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1.0  SUMMARY 


Probabilistic  modeling  and  machine  learning  have  proven  to  be  powerful  tools  in  many  defense, 
industrial,  and  scientific  computing  applications.  Unfortunately,  their  continuing  adoption  has 
been  hindered  because  engineering  with  them  requires  PhD-level  expertise.  Building  these 
systems  involves  simultaneously  developing  mathematical  models,  inference  algorithms  and 
optimized  software  implementations.  Small  changes  to  the  underlying  modeling  assumptions, 
data,  or  compute  budget  often  require  end-to-end  redesigns  and  re-implementations  that  can  lead 
to  surprising  changes  to  system  accuracy  and  performance. 

DARPA’s  Probabilistic  Programming  for  Advancing  Machine  Learning  (PPAML)  program 
aimed  to  address  these  problems.  Specific  goals  identified  by  the  program  were  (i)  reducing  the 
lines  of  code  required  to  build  state-of-the-art  machine  learning  systems;  (ii)  making  machine 
learning  and  data  science  capabilities  accessible  to  a  broader  class  of  programmers;  (iii)  making 
it  possible  to  deploy  rich  generative  models  to  solve  applied  problems,  and  thereby  reduce  the 
amount  of  training  data  required,  and  (iv)  identifying  interfaces  and  abstractions  that  unify 
probabilistic  programming  languages  and  enable  multiple  inference  strategies  or  "solvers"  to 
interoperate.  The  program  also  aimed  to  create  the  technical  foundations  for  new  collaborations 
between  the  machine  learning  and  programming  language  research  communities,  so  that  future 
progress  could  be  made  by  incremental  research  and  engineering. 

Over  the  course  of  the  program,  we  succeeded  in  achieving  these  ambitious  goals.  Specific 
highlights  include: 

1.  We  developed  Venture,  a  probabilistic  programming  system  with  support  for  custom 
inference  strategies.  We  showed  that  Venture  makes  it  possible  to  reimplement  and 
extend  the  Automatic  Statistician,  a  state-of-the-art  machine  learning  system,  in  just  70 
lines  of  code,  or  roughly  ~50X  fewer  lines  of  code  than  the  original  implementation. 

2.  We  developed  BayesDB,  which  makes  it  possible  for  ordinary  developers  to  solve  hard 
data  science  and  machine  learning  problems  using  a  simple,  SQL-like  language. 

BayesDB  was  evaluated  by  DARPA  during  PPAML  via  two  hackathons,  showing  that  it 
is  possible  to  use  probabilistic  programming  to  solve  data  science  problems  with  an 
accuracy  that  exceeds  solutions  produced  by  expert  statistical  consultants.  BayesDB  has 
yielded  multiple  industry  and  government  partnerships  and  been  adopted  as  a  component 
of  the  DARPA  Synergistic  Discovery  and  Design  (SD2)  program. 

3.  We  developed  Picture,  which  makes  it  possible  to  write  ~50  line  programs  that  use 
generative  modeling  to  solve  hard  3D  computer  vision  problems.  These  programs  require 
no  training  data  at  all,  but  match  or  exceed  the  accuracy  of  state-of-the-art  baselines  that 
are  based  on  machine  learning  from  training  data.  Picture  won  a  best  paper  (honorable 
mention)  award  at  CVPR  2015,  the  top  international  computer  vision  conference. 

4.  We  developed  Metaprob,  a  probabilistic  meta-programming  language.  In  Metaprob,  both 
probabilistic  models  and  custom  inference  strategies  can  be  written  as  user-space  code  in 
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a  single  language,  using  novel  reflective  programming  constructs.  It  is  thus  sufficiently 
expressive  that  Venture,  Picture,  and  BayesDB  —  as  well  as  earlier  languages  such  as 
Church  —  could  (in  principle)  all  be  re-implemented  within  it.  It  also  addresses  a  core 
intellectual  challenge  posed  by  Dr.  Kathleen  Fisher,  the  original  PPAML  program 
manager:  how  to  integrate  approaches  to  inference  in  probabilistic  programs. 

Our  PPAML  research  generated  significant  interest  and  investment  from  industry.  For  example, 
it  led  to  invited  briefings  for  the  CEOs  of  Intel  Corporation  (given  by  PI  Vikash  Mansinghka) 
and  Microsoft  Corporation  (given  by  Co-PI  Josh  Tenenbaum).  It  also  led  to  the  formation  of 
Empirical  Systems,  a  VC-backed  startup  based  on  BayesDB,  and  the  deployment  of  the 
BayesDB  open-source  prototype  at  JP  Morgan.  Engineers  at  other  companies,  including 
Zymergen,  Inc.,  have  used  BayesDB  to  detect  predictive  relationships  between  variables. 

Our  work  on  PPAML  has  also  begun  to  catalyze  the  creation  of  an  academic  community  in 
probabilistic  programming.  For  example.  Pis  Mansinghka  and  Tenenbaum  gave  an  invited 
tutorial  at  NIPS  2017  on  probabilistic  programming,  with  over  2000  attendees.  Mansinghka  also 
gave  invited  tutorials  at  multiple  O'Reilly  AI  conferences,  with  hundreds  of  participants.  This 
work  also  led  to  the  first  full-semester  graduate  course  in  probabilistic  programming,  taught  at 
MIT  by  PI  Vikash  Mansinghka,  and  a  book  contract  for  an  introductory  book  on  Probabilistic 
Programming  with  the  MIT  Press.  Preparation  for  all  these  tutorials,  courses,  and  teaching 
material  was  supported  by  PPAML. 

This  report  is  structured  as  follows.  The  next  section  gives  an  introduction  to  probabilistic 
programming  and  a  brief  overview  of  the  main  probabilistic  languages  we  developed  over  the 
course  of  the  program.  The  methods  section  outlines  the  research  approach.  The  results  and 
discussion  section  gives  representative  quantitative  and  qualitative  results  showcasing 
capabilities  of  PPAML  technology.  Finally,  the  conclusion  outlines  recommendations  that  could 
guide  future  research  investments  in  probabilistic  programming  and  highlights  steps  being  taken 
to  transition  probabilistic  programming  to  problems  of  pressing  interest  to  the  US  government, 
especially  the  Department  of  Defense. 
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2.0  INTRODUCTION^ 


Probabilistic  modeling  and  machine  learning  have  proven  to  be  powerful  tools  in  many  defense, 
industrial,  and  scientific  computing  applications.  Unfortunately,  their  continuing  adoption  has 
been  hindered  because  engineering  with  them  requires  PhD-level  expertise.  Building  these 
systems  involves  simultaneously  developing  mathematical  models,  inference  algorithms  and 
optimized  software  implementations.  Small  changes  to  the  underlying  modeling  assumptions, 
data,  or  compute  budget  often  require  end-to-end  redesigns  and  reimplementations  that  can  lead 
to  surprising  changes  to  system  accuracy  and  performance. 

Probabilistic  programming  is  an  emerging  field  at  the  intersection  of  artificial  intelligence  and 
programming  languages,  that  aims  to  address  these  problems.  By  encapsulating  and  automating 
inference,  learning  and  prediction,  probabilistic  programming  provides  a  promising  path 
forwards,  and  has  the  potential  to  radically  simplify  the  process  of  building  systems  that  use 
probabilistic  representations  of  knowledge  to  interpret  empirical  data.  But  after  DARPA's 
investment  in  PPAML,  what  is  probabilistic  programming  known  to  be  useful  for,  in  practice? 
and  what  are  the  fundamental  technical  problems  that  organize  research  in  the  field? 

Probabilistic  programming  is  already  influencing  research  in  several  other  fields.  The  most 
visible  successes  are  currently  in  Bayesian  data  analysis,  where  probabilistic  programming  has 
made  state-of-the-art  hierarchical  modeling  and  Monte  Carlo  inference  techniques  accessible  to  a 
broad  audience.  For  example,  the  Stan  language  for  hierarchical  Bayesian  modeling  has  been 
adopted  by  a  community  of  thousands  of  scientists  and  statisticians.  Probabilistic  programming 
languages  also  make  it  possible  to  rapidly  prototype  systems  based  on  inference  in  probabilistic 
models  that  would  otherwise  be  prohibitively  complex.  Example  applications  include  detecting 
and  localizing  nuclear  explosions  on  the  earth's  surface,  inferring  3D  object  geometry  from 
single  images,  and  determining  the  probable  lineages  of  malware.  Probabilistic  programming 
languages  have  been  used  to  reimplement  state-of-the-art  techniques  for  Al-assisted  data  analysis 
in  ~50x  fewer  lines  of  code  than  is  possible  in  P5dhon  or  MATLAB.  Finally,  probabilistic 
programs  are  the  basis  of  new  models  of  human-level  concept  learning  and  reasoning.  For 
example,  a  recent  model  of  how  people  can  learn  handwritten  characters  from  just  one  training 
example  is  formulated  in  terms  of  probabilistic  program  induction,  i.e.  learning  probabilistic 
programs  from  data. 

Probabilistic  programming  is  based  on  two  technical  ideas: 

Probabilistic  models  can  be  represented  using  programs.  These  programs  can  generate 
samples  from  the  probability  distribution  associated  with  the  model,  or  calculate  the  probability 
density  of  values  sampled  from  the  distribution. 

Operations  on  models  such  as  inference  and  learning  can  be  implemented  as  meta¬ 
programs.  These  meta-programs  produce,  consume,  execute,  transform,  or  modify  other 
probabilistic  programs.  Their  inputs  and  outputs  are  other  probabilistic  programs,  represented  as 


^  This  section  is  derived  from  Mansinghka  et  al.,  2015;  Kulkarni  et  al.,  2015;  Mansinghka, 
Selsam,  and  Perov,  2014. 
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executables  and/or  source  code,  as  well  as  representations  of  other  probabilistic  programs' 
runtime  behavior. 

These  two  ideas  lead  to  a  broad  design  space  of  languages  and  engineering  approaches  supported 
by  those  languages.  Some  probabilistic  languages  are  embedded  in  widely-used  programming 
languages;  for  example,  Figaro  is  embedded  in  Scala,  WebPPL  is  embedded  in  JavaScript, 
Anglican  is  embedded  in  Clojure,  and  Picture  is  embedded  in  Julia.  Other  probabilistic  languages 
are  stand-alone;  prominent  examples  include  BLOG,  Stan,  and  Venture.  Some  languages,  such 
as  Church,  BLOG,  and  Stan,  are  based  on  black-box  inference  engines,  while  others,  such  as 
Venture,  Picture,  and  Edward,  provide  constructs  that  let  users  specify  custom  inference 
strategies.  Some  languages  and  implementations  are  closely  designed  around  the  needs  of  a 
specific  class  of  applications,  while  others  aim  to  be  more  general-purpose.  Some  languages 
have  explored  the  value  of  restricting  expressivity  to  ensure  that  specific  inference  strategies 
apply  to  all  programs.  For  example,  Infer.NET  does  not  support  models  defined  in  terms  of 
recursive  processes,  which  enables  static  compilation  into  graphical  models.  No  one  approach 
appears  to  be  best  suited  for  all  problem  domains. 

Attempts  to  formalize  these  ideas  have  also  yielded  new  theorems  that  connect  fundamental 
concepts  in  probability  theory  with  theoretical  computer  science.  For  example,  what  are  the 
limits  of  mechanizing  probabilistic  inference?  Ackerman,  Freer,  and  Roy  (2010)  recently  showed 
that  it  is  possible  to  construct  a  pair  of  computable  random  variables  (X,  Y)  in  the  unit  interval 
whose  conditional  distribution  P[YIX]  encodes  the  halting  problem.  However,  they  also  showed 
that  inference  can  be  mechanized  as  long  as  measurements  are  known  to  be  sufficiently  noisy. 
These  theorems  show  that  studying  probabilistic  objects  and  operations  in  computational  terms 
can  lead  to  new  insights.  They  also  provide  important  constraints  on  the  design  of  probabilistic 
programming  languages. 

At  the  start  of  the  PPAML  program,  the  available  languages  and  systems  were  largely  unsuitable 
for  use  by  practitioners.  Examples  of  drawbacks  include  inadequate  and  unpredictable  runtime 
performance,  limited  expressiveness,  batch-only  operation,  lack  of  extensibility,  and  reliance  on 
opaque  approximation  techniques  for  inference.  The  first  aim  of  our  research  in  PPAML  was  to 
develop  technology  that  enabled  users  to  reliably  write,  test,  debug,  and  optimize  probabilistic 
programs  and  reason  about  their  behavior,  without  requiring  detailed  knowledge  about  how 
inference  is  implemented.  A  second  aim  of  our  research  was  to  demonstrate  the  effectiveness  of 
probabilistic  programming  on  a  broad  class  of  important  problems. 

We  initially  proposed  to  focus  on  developing  Venture,  an  interactive  platform  for  general- 
purpose  probabilistic  programming  that  delivers  practical  and  predictable  inference  performance. 
However,  over  the  course  of  the  program  it  became  clear  different  programming  languages  were 
needed  to  target  different  domains,  specially  the  "data  science"  use  cases  emphasized  by 
DARPA's  PPAML  challenge  problems,  and  problems  of  visual  scene  understanding  that  are  of 
broad  interest  to  the  defense  community.  It  also  became  clear  that  there  were  fundamental 
opportunities  to  do  research  in  probabilistic  meta-programming,  motivated  by  capability 
limitations  in  Venture  that  were  highlighted  by  DARPA's  challenge  problems.  This  report 
describes  both  the  research  we  initially  proposed  and  the  evolution  of  the  research  over  the 
course  of  the  program.  It  thus  outlines  the  development  of  Venture,  but  also  BayesDB,  Picture, 
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and  MetaProb.  Together,  these  languages  informed  our  solutions  to  DARPA's  challenge 
problems,  to  the  teaching  of  probabilistic  programming  at  industry  conferences,  and  to  the 
solution  of  the  team  challenge  problems  we  described  in  our  initial  proposal. 

2.1  An  example  of  probabilistic  programming 


Consider  the  problem  of  inferring  curves  from  noisy  data.  The  figure  below  gives  an  example  of 
this  problem,  along  with  example  results  from  three  probabilistic  models  that  each  attempt  to 
solve  the  problem. 


Four  data  sets 


Inferences  using  model 
without  outliers 


Inferences  using  model 
with  fixed  hyperparameters 


Inferences  using  model 
with  hyperparameter  uncertainty 


-5  0  5 


Figure  1.  The  problem  of  inferring  curves  from  data  while  simultaneously  choosing  the  functional  form  of  the  curve  and  rejecting 
outliers.  The  top  row  shows  example  datasets,  while  the  next  three  rows  show  inference  results  from  three  probabilistic  models. 
Inference  results  include  the  data  (with  dots  shaded  according  to  the  probability  with  which  they  are  an  outlier)  and  randomly 
sampled  curves  that  explain  the  data  (drawn  in  grey).  The  first  model  has  no  random  variables  encoding  outliers.  The  second 
model  supports  outliers  but  makes  fixed  assumptions  about  a  broad  class  of  model  hyperparameters.  The  third  model  (bottom 
row)  allows  those  hyperparameters  to  be  inferred  from  the  data.  It  is  the  only  model  that  can  robustly  solve  all  of  the  example 
problems. 
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for  i  =  1 . . . 


Figure  2.  An  example  probabilistic  model  for  solving  the  problem  of  inferring  curves  that  explain  data  and  rejecting  outliers.  The 
left  column  shows  the  random  variables  in  standard  mathematical  notation.  The  right  column  shows  an  equivalent  probabilistic 
graphical  model. 


This  type  of  probabilistic  model  includes  several  characteristics  that  make  it  difficult  to  solve. 
First,  the  effective  number  of  parameters  in  the  model  is  itself  a  random  variable.  Second,  there 
is  one  binary  variable  for  each  datapoint,  determining  whether  or  not  the  datapoint  is  modeled  by 
the  underlying  curve  or  is  treated  as  an  outlier. 

However,  this  type  of  model  is  straightforward  to  represent  as  a  probabilistic  program. 
Probabilistic  programs  represent  models  using  code  that  makes  stochastic  choices  for  each  latent 
variable  in  the  model.  This  approach  allows  powerful  programming  constructs  to  be  used  to 
define  probabilistic  modeling  assumptions.  For  example,  conditional  statements  can  be  used  to 
encode  choices  over  model  structure.  The  next  figure  gives  an  example  probabilistic  program 
encoding  this  model,  along  with  an  example  execution  trace  of  this  program. 
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probabilistic  function  model(x: :Vector{Float64}) 


#  prior  over  degree  of  polynomial 
degree_prior  -  [0.25,  0.25,  0.25,  0.25] 

#  generate  degree  (either  1,  2,  3,  or  4) 

degree  ■  @choice( categorical (degreejjrior),  "degree") 

#  generate  parameters 

parameters  *  Vector{Float64}(degree+l) 
for  k=l:(degree+l) 

parameters[k]  *  @choice(normal(prior_mean,  prior_std),  "theta-$k") 
end 

tt  generate  data 
y  =  Vector{Float64}(length(x)) 
for  i»l:length(x) 
if  degree  «  1 

y_mean  -  dot( parameters,  [1.,  x[i]]) 
elseif  degree  »=  2 

y_mean  ■  dot( parameters,  [1.,  x[i],  x[i]''2]) 
elseif  degree  ==  3 

y_mean  -  dot( parameters,  [1.,  x[i],  x[i]''2,  x[i]''3]) 

else 

y_mean  -  dot(parameters,  [1.,  x[i],  x[i]''2,  x[i]''3,  x[i]M]) 
end 

is_outlier  =  phoice(f lip(prob_outlier),  "outlier-$i") 
noise  =  is_outlier  ?  outlier_noise  :  inlier_noise 
y[i]  =  @choice(normal(y_mean,  noise),  "y-$i") 
end 
end 


"degree" 

1 

"theta-1" 

1.20 

"theta-2" 

-0.20 

"outlier- 1" 

false 

"outlier- 2" 

false 

"outlier- 3" 

false 

"outlier-4" 

false 

"y-1" 

-0.22 

"y-2" 

0.10 

"y-3" 

-0.70 

"y-4" 

1.60 

One  possible  execution  trace 
of  the  program 


As  a  probabilistic  program 


with  input  x  =  [-3,  0j  2,  3] 
and  output y  =  [-0.22,  0.1,  -0.70,  1.60] 


Figure  3.  Probabilistic  program  source  code  for  solving  the  same  problem  (left),  along  with  an  example  execution  trace  of  this 
probabilistic  program  (right). 
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•  Inlier 

-  Mean  of  y 
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-  Inlier  variability  (+/-  2a) 

Figure  4.  Additional  example  execution  traces  of  the  probabilistic  program  from  the  previous  figure. 
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@probabilistic  function  model(x: :\/ector{Float64}) 

#  prior  over  degree  of  polynomial 
degree_prior  =  [0.25,  0.25,  0.25,  0.25] 

#  generate  degree  (either  1,  2,  3,  or  4) 

degree  =  @choice(categorical(degree_prior),  "degree") 

#  generate  parameters 

parameters  =  \/ector{Float64}(degree+l) 
for  k=l: (degree+1) 

parameters[k]  =  @choice(normal(0,  1),  "theta-$k") 

end 


#  hyperparameters 

inlier_noise  =  @choice(gamma(2. ,  1.),  "inlier-noise") 
outlier_noise  =  @choice(gamma(10. ,  1.),  "outlier-noise") 
prob_outlier  =  @choice(beta(l.,  20.),  "prob-outlier") 


#  generate  data 
y  =  \/ector{Float64}(length(x) ) 
for  i=l: length(x) 
if  degree  ==  1 

y_mean  =  dot (parameters,  [1.,  x[i]]) 
elseif  degree  ==  2 

y_mean  =  dot  (parameters,  [1.,  x[i],  x[i]''2]) 
elseif  degree  ==  3 

y_mean  =  dot  (parameters,  [1.,  x[i],  x[i]''2,  x[i]''3]) 

else 

y_mean  =  dot  (parameters,  [1.,  x[i],  x[i]''2,  x[i]''3,  x[i]''4]) 

end 

is_outlier  =  @choice(flip(prob_outlier) ,  "outlier-$i") 
noise  =  is_outlier  ?  outlier_noise  :  inlier_noise 
y[i]  =  @choice(normal(y_mean,  noise),  "y-$i") 
end 
end 


Figure  5.  Probabilistic  source  code  for  an  extended  model  that  includes  stochastic  choices  for  the  model  hyper-parameters. 
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(trace,  weight)  =  query(program,  args,  observations) 


X  ^  V  a  y 

Distribution  on  traces  induced  by  executing  program  p(x;  V,  a) 

(e.g.  the  prior) 

Distribution  on  traces  conditioned  on  observations  P(x|yi  (X  p{x.',  'P ,  a)  IliGy  Vi) 

(e.g.  the  posterior) 

Distribution  on  traces  sampled  during  query  execution  g(x;  P,  a,  y)  ~  j>(x|y;  P,  a) 

(e.g.  the  posterior  approximation) 

Figure  6.  An  overview  of  the  interface  to  inference  in  probabilistic  programs. 


Probabilistic  programming  languages  provide  mechanisms  for  inferring  the  probable  execution 
traces  of  programs  given  data.  The  data  provides  constraints  on  the  ehoiees  in  the  execution 
traces  of  the  probabilistic  code.  The  above  figure  gives  a  schematic  description  of  this  notion  of 
inference. 


observations  =  Trace() 
observations["y-2"]  =  0.0 

(trace,  weight)  =  query(model,  ([-3,  0,  2,  3],),  observations) 


Figure  7.  An  example  of  querying  the  probabilistic  program  shown  earlier  given  a  single  observed  datapoint.  Given  this  one  data 
point,  a  broad  class  of  curves  of  very  different  shapes  are  all  probable. 
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observations  =  Trace() 
observations["y-l"]  =  -3.0 
observations["y-2"]  =  0.0 
observations["y-3"]  =  2.0 
observations["y-4"]  =  3.0 

(trace,  weight)  =  query(model,  ([-3,  0,  2,  3],),  observations) 


-4  -2  0  2  4 


Figure  8.  Querying  a  probabilistic  program  given  several  observed  datapoints.  Note  that  the  generated  curves  all  roughly  align 
with  the  observed  data. 


2.2  Venture 

Venture  descends  from  Church,  one  of  the  first  Turing-universal,  higher-order  probabilistic 
programming  languages  —  introduced  and  developed  by  a  research  group  which  included 
several  members  of  our  team.  Venture  retains  the  expressiveness  of  the  Church  language,  while 
adding  interactivity,  greater  extensibility,  and  vastly  improved  efficiency,  scalability,  and 
predictability. 

The  specific  focus  of  our  work  on  Venture  was  supporting  the  use  of  custom  inference  strategies. 
Custom  inference  strategies  to  solve  probabilistic  modeling  and  inference  problems  is  now 
mainstream  practice  across  a  broad  range  of  fields.  Current  probabilistic  programming 
languages,  however,  typically  provide  only  a  specific  black-box  inference  strategy  or  set  of 
strategies.  Over  the  course  of  the  PPAML  program,  we  developed  novel  probabilistic 
programming  constructs  for  custom  inference  strategies.  Below,  we  describe  results  obtained  by 
implementing  these  strategies  in  the  Venture  probabilistic  programming  language.  These 
constructs  enable  developers  to  identify  inference  sub-problems,  then  apply  an  appropriately 
chosen  inference  tactic  to  each  identified  sub-problem.  Experimental  results  on  a  collection  of 
probabilistic  programs,  written  in  the  Venture  probabilistic  programming  language,  highlight  the 
significant  performance  and  accuracy  benefits  that  custom  inference  strategies  can  deliver. 
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Our  goals  with  Venture  over  the  course  of  the  program  were  to: 

1.  Reduce  the  development  time  and  code  complexity  needed  to  build  a  wide  range  of  state-of- 
the-art  machine  learning  and  probabilistic  modeling  systems. 

2.  Enable  domain  experts  to  solve  typical  data  analysis  problems  in  their  domains  using  short 
probabilistic  scripts  and  custom  variations  on  standard  modeling  templates,  without  deep 
technical  expertise  in  machine  learning  or  probabilistic  inference. 

3.  Enable  probabilistic  programming  experts  to  prototype  and  incrementally  optimize  machine 
learning  systems  that  are  currently  infeasible  to  build  at  all. 

In  doing  this,  we  have  substantially  advanced  the  technical  foundations  of  probabilistic 
programming  and  produced  an  open-source  system  that  demonstrates  the  utility  of  these 
advances.  Venture  is  now  significantly  more  efficient  and  extensible  than  previous 
implementations,  and  Venture  programs  are  simpler  to  understand,  test,  debug  and  optimize.  We 
have  improved  Venture’s  scalability,  inference  latency,  and  usability. 


It  is  easy  to  find  examples  that  demonstrate  that  custom  inference  is  important  in  practice.  For 
example,  the  following  figure  shows  that  even  on  the  simple  problem  of  inferring  curves  from 
data,  with  outliers,  the  choice  of  inference  strategy  matters.  Different  inference  strategies  (left) 
yield  significantly  different  inferences  (right).  In  these  figures,  the  inferred  curves  are  shown  in 
blue,  real  data  in  green,  and  true  outliers  in  red.  Note  that  for  all  but  the  strategy  on  the  top  left, 
the  curves  do  not  consistently  line  up  with  the  real  data. 
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(a)  LBFGS  &  Gibbs 


(b)  SSMH 


(c)  HMC  &  Gibbs 


(d)  Ordered  MH  &  Gibl>s 


Figure  9.  Choosing  the  right  inference  strategy  for  the  problem  (Mansinghka,  Selsam  and  Perov,  2014[1];  Mansinghka, 
Schaechtle,  Radul,  Handa,  Rinard,  in  review). 
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2.3  BayesDB 

Another  probabilistic  programming  language  we  developed  over  the  course  of  PPAML  was 
BayesDB.  BayesDB  was  developed  in  response  to  the  demand  for  applications  to  data  science 
problems  and  performed  well  in  multiple  DARPA  evaluations  during  PPAML.  BayesDB  is  a 
platform  for  Al-assisted  data  science  that  enables  domain  experts  to  answer  questions  in  seconds 
or  minutes  that  otherwise  require  hours  or  days  of  work  by  someone  with  good  statistical 
judgment. 

Why  is  this  important?  Stakeholders  in  business,  humanitarian  work,  science,  and  government 
are  increasingly  recognizing  the  importance  of  making  statistical  inferences  from  their  data. 
Existing  approaches  to  this  problem  require  experts  in  statistical  modeling  or  data  scientists 
proficient  in  applied  machine  learning.  These  skills  are  projected  to  be  in  short  supply  as  the 
importance  of  statistical  inference  is  increasingly  recognized  across  a  variety  of  fields.  Also, 
developers  new  to  machine  learning  may  be  stymied  by  the  maze  that  is  the  current  machine 
learning  toolkit.  This  toolkit  can  come  up  short  in  settings  that  don’t  match  canonical  machine 
learning  problems. 

How  does  BayesDB  solve  these  problems?  First,  BayesDB  provides  a  simple,  SQL-like  query 
language  for  asking  data  science  queries.  This  language  can  be  used  for  data  search,  inferential 
statistics,  and  predictive  modeling  applications.  Second,  BayesDB  provides  AI  assistance  for 
exploratory  data  analysis  and  baseline  statistical  modeling  via  CrossCat,  a  probabilistic  method 
that  emulates  many  of  the  judgment  calls  ordinarily  made  by  a  human  data  analyst.  This  AI 
assistance  enables  domain  experts  who  lack  training  in  statistics  to  draw  rigorous  inferences 
from  messy  real-world  databases.  Third,  BayesDB  provides  an  advanced  “meta-modeling” 
language  for  customizing  the  AFs  modeling  assumptions  to  incorporate  both  quantitative  and 
qualitative  domain  knowledge.  This  enables  expert  statisticians  to  improve  inference  quality  in 
risk-sensitive  applications. 

Example  questions  that  the  BayesDB  open-source  prototype  has  been  used  to  answer  include  the 
following: 

1.  Probabilistic  search:  query  by  example  (“find  me  the  10  colleges  most  like  MIT  and 
Harvard  with  regards  to  graduates’  median  income,  but  admissions  rates  over  I0%”),  and 
query  by  probability  (“find  me  colleges  with  the  most  unexpectedly  high  mean  income 
for  graduates”). 

2.  AI  assessment  of  data  quality:  “find  me  radiation  measurements  from  the  Safecast.org 
project  that  are  most  likely  to  be  errors  and/or  legitimate  anomalies.” 

3.  Virtual  experiments:  “generate  some  EEG  measurements  we  might  expect  for  a  child  in 
Bangladesh  at  age  3  years  of  age,  given  all  the  other  EEG  measurements  we  have 
observed  so  far.” 

4.  Al-assisted  inferential  statistics:  “what  genetic  markers,  if  any,  predict  increased  risk  of 
suicide  given  a  PTSD  diagnosis?  and  how  confident  can  we  be  in  the  amount  of  increase, 
given  uncertainty  due  to  statistical  sampling  and  the  large  number  of  possible  alternative 
explanations?” 
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Machine  learning: 

Math  is  not  enough;  need  domain  knowledge. 
Difficulty  with  low  N,  incomplete  data. 

Traditional  research: 

Difficulty  with  high  N,  high  D,  incomplete  data. 
Problems  with  stale  data  and  models. 


Substantive  "Danger  Zone!”: 

Expertise  Methodological  errors  can  yield  false  results 


Figure  10.  Data  science  is  difficult  -  Adapted  from  (Drew  Conway,  2013).  This  figure  summarizes  the  problems  that  BayesDB  is 
designed  to  solve. 


BayesDB  can  enable  people  in  the  "danger  zone"  to  rely  on  AI  assistance  in  lieu  of  math  & 
statistics  knowledge.  BayesDB's  probabilistic  programming  language  for  Bayesian  model 
discovery  also  addresses  technical  limitations  of  standard  techniques  in  machine  learning  and 
statistics.  It  thus  has  the  potential  to  be  useful  to  people  in  the  "machine  learning"  and 
"traditional  research"  sections  of  the  diagram. 


Goal:  Enable  domain  experts  to  solve  problems  in  seconds  to  hours  that 
currently  take  hours  to  weeks  for  someone  with  expertise  in  statistics 


Figure  11.  BayesDB:  probabilistic  programming  with  Bayesian  model  discovery.  BayesDB  is  applicable  to  data  science  problems 
that  range  along  the  spectrum  of  model  detail,  from  descriptive  and  exploratory  analysis  to  predictive,  causal,  and  mechanistic 
modeling. 


Approved  for  Public  Release;  Distribution  Unlimited. 

13 


Questions 

1  EEG:  How  is  EEG  different  in  an 
autistic  child  vs.  a  control? 

P  Deioitte:  Which  companies  are  have 
similar  bankruptcy  risks? 

^  Pharmacogenomics:  Can  we  predict  adverse 
events  based  on  genetic  data? 

4  Satellites:  What  country  probably  owns  an 
unidentified  satellite? 

5  Safecast:  Which  radioactivity  measurements 
are  unusual? 


Figure  12.  Example  datasets  to  which  BayesDB  has  been  applied. 


Key  technical  challenges: 

1.  Bayesian  model  discovery  to  learn 
baseline  models  from  sparse,  incomplete 
data 

Mansinghka  et  al.  (JMLR  2016,  CogSci  2006) 

2.  Composing  empirical  models  with  custom 
ML  algorithms  and  probabilistic  programs 

Saad  &  Mansinghka  (NIPS  2016) 

3.  Querying  models  using  a  simple  SQL-like 
language 

Mansingkha  et  al.  (arXiv  2016) 


Figure  13.  An  overview  of  the  architecture  of  BayesDB. 


Building  BayesDB  required  solving  technical  challenges  such  as  (i)  automatically  discovering 
models  from  sparse,  incomplete  data;  (ii)  combining  empirical  models  with  custom  ML 
algorithms  and  probabilistic  programs;  and  (iii)  providing  a  simple,  SQL-like  language  by  which 
end  users  can  query  these  models. 
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Columns:  votes  on  ~1000  bills  in  the  111th  US  Senate  Session 


Raw  voting 
data 


BayesDB 
Model  #1 
(of  100) 


BayesDB 
Model  #2 
(of  100) 


Rows: 

senators 


Partisan  issues 


Cluster  1: 
democrat 

Cluster  2: 
republican 


Partisan  issues 


Issues  w/  bipartisan 
agreement;  all 
senators  in  the 
same  cluster 


Special  interests, 
with  voting  blocks 


Cluster  1: 
democrat 

Cluster  2: 
republican 


Figure  14.  An  example  of  Bayesian  model  discovery  in  BayesDB  applied  to  a  database  of  Senate  voting  records. 


BayesDB  can  discover  models  from  a  broad  class  of  real  world  databases.  The  above  figure 
shows  an  example  of  the  structure  of  the  models  discovered  by  BayesDB  from  voting  records  for 
the  111th  US  Senate  (2009).  Some  senators  are  highlighted  in  colors  based  on  their  generally 
accepted  identification  as  democrats  (blue),  moderates  (purple),  or  republicans  (red).  The  middle 
row  shows  a  simple  or  low  resolution  posterior  sample  that  divides  bills  into  those  that  exhibit 
partisan  alignment  and  those  with  bipartisan  agreement.  Clusters  of  senators,  generated 
automatically,  are  separated  by  thick  black  horizontal  lines.  The  bottom  row  shows  a  sample  that 
includes  additional  views  and  clusters,  positing  a  finer-grained  predictive  model  for  votes  that 
are  treated  as  random  in  the  middle  row.  (Mansinghka  et  al.,  2016). 
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Figure  15.  A  mathematical  description  non-parametric  Bayesian  prior  used  for  Bayesian  model  discovery  in  BayesDB.  This 
probabilistic  generative  model  induces  a  prior  distribution  over  a  hypothesis  space  of  probabilistic  programs  that  model 
multivariate  data.  (Mansinghka  et  al.,  2016) 
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Figure  16.  Example  queries  in  BayesDB  against  two  probabilistic  programs  that  can  be  generated  by  the  non-parametric  Bayesian 
prior  from  the  previous  figure. 


From  a  technical  perspective,  BayesDB  consists  of  four  components,  integrated  into  a  single 
probabilistic  programming  system: 

1.  The  Bayesian  Query  Language  (BQL),  an  SQL-like  query  language  for  Bayesian  data 
analysis.  BQL  programs  can  solve  a  broad  class  of  data  analysis  problems  using 
statistically  rigorous  formulations  of  cleaning,  exploration,  confirmatory  analysis,  and 
predictive  modeling.  BQL  defines  these  primitive  operations  for  these  workflows  in 
terms  of  Bayesian  model  averaging  over  results  from  an  implicit  set  of  multivariate 
probabilistic  models. 

2.  A  mathematical  interface  that  enables  a  broad  class  of  multivariate  probabilistic 
models,  called  generative  population  models,  to  be  used  to  implement  BQL.  According  to 
this  interface,  a  data  generating  process  defined  over  a  fixed  set  of  variables  is 
represented  by  (i)  an  infinite  array  of  random  realizations  of  the  process,  including  any 
observed  data,  and  (ii)  algorithms  for  simulating  from  arbitrary  conditional  distributions 
and  calculating  arbitrary  conditional  densities.  This  interface  permits  many  statistical 
operations  to  be  implemented  once,  independent  of  the  specific  models  that  will  be  used 
to  apply  these  operations  in  the  context  of  a  particular  data  table. 

3.  The  BayesDB  Meta-modeling  Language  (MML),  a  minimal  probabilistic 
programming  language.  MML  includes  constructs  that  enable  statisticians  to  integrate 
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custom  statistical  models  —  including  arbitrary  algorithmic  models  contained  in  external 
software  —  with  the  output  of  a  broad  class  of  Bayesian  model  building  techniques. 
MML  also  includes  constructs  for  specifying  qualitative  dependence  and  independence 
constraints. 

4.  A  hierarchical,  semi-parametric  Bayesian  “meta-model”  that  automatically  builds 
ensembles  of  generalized  mixture  models  from  database  tables.  These  ensembles  serve  as 
baseline  data  generators  that  BQL  can  use  for  data  cleaning,  initial  exploration,  and  other 
routine  applications. 

This  design  insulates  end  users  from  most  statistical  considerations.  Queries  are  posed  in  a 
qualitative  probabilistic  programming  language  for  Bayesian  data  analysis  that  hides  the  details 
of  probabilistic  modeling  and  inference.  Baseline  models  can  be  built  automatically  and 
customized  by  statisticians  when  necessary.  All  models  can  be  critically  assessed  and 
qualitatively  validated  via  predictive  checks  that  compare  synthetic  rows  (generated  via  BQL’s 
SIMULATE  operation)  with  rows  from  the  original  data.  Instead  of  hypothesis  testing, 
dependencies  between  variables  are  obtained  via  Bayesian  model  selection. 

BayesDB  is  “Bayesian”  in  two  ways: 

1.  In  BQL,  the  objects  of  inference  are  rows,  and  the  underlying  probability  model  forms 
a  “prior”  probability  distribution  on  the  fields  of  these  rows.  This  is  then  constrained  by 
row-specific  observations  to  create  a  posterior  distribution  of  field  values.  Without  this 
prior,  it  would  be  impossible  to  simulate  rows  or  infer  missing  values  from  partial 
observations. 

2.  In  MML,  the  default  meta-model  is  Bayesian  in  that  it  assigns  a  prior  probability  to  a 
very  broad  class  of  probabilistic  models  and  narrows  down  on  probable  models  via 
Bayesian  inference.  This  prior  is  unusual  in  that  it  encodes  a  state  of  ignorance  rather 
than  a  strong  inductive  constraint.  MML  also  provides  instructions  for  augmenting  this 
prior  to  incorporate  qualitative  and  quantitative  domain  knowledge. 

In  practice,  it  is  useful  to  use  BQL  for  Bayesian  queries  against  models  built  using  non-Bayesian 
or  only  partially  Bayesian  techniques.  For  example,  MML  supports  composing  the  default  meta¬ 
model  with  modeling  techniques  specified  in  external  code  that  need  not  be  Bayesian.  However, 
the  default  is  to  be  Bayesian  for  both  model  building  and  query  interpretation,  as  this  ensures  the 
broadest  applicability  of  the  results. 

2.4  Picture 

The  idea  of  computer  vision  as  the  Bayesian  inverse  problem  to  computer  graphics  has  a  long 
history  and  an  appealing  elegance,  but  it  has  proved  difficult  to  directly  implement.  Instead,  most 
vision  tasks  are  approached  via  complex  bottom-up  processing  pipelines.  Our  research  has 
shown  that  it  is  possible  to  write  short,  simple  probabilistic  graphics  programs  that  define 
flexible  generative  models  and  to  automatically  invert  them  to  interpret  real-world  images.  It  has 
also  shown  that  custom  inference  strategies  can  improve  performance  on  solving  these  problems. 
If  scaled  up,  this  approach  has  the  potential  to  address  a  broad  class  of  visual  scene 
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understanding  problems  faced  by  the  Department  of  Defense,  and  could  yield  solutions  with  both 
reduced  code  complexity,  improved  accuracy,  and  other  functional  benefits  such  as  the  ability  to 
report  uncertainty  in  scene  percepts  for  use  by  downstream  processing. 

Picture  is  a  domain  specific  probabilistic  programming  language  for  this  class  of  problems.  The 
approach  implemented  by  Picture  is  called  generative  probabilistic  graphics  programming. 
Generative  probabilistic  graphics  programs  consist  of  a  stochastic  scene  generator,  a  renderer 
based  on  graphics  software,  a  stochastic  likelihood  model  linking  the  renderer’ s  output  and  the 
data,  and  latent  variables  that  adjust  the  fidelity  of  the  renderer  and  the  tolerance  of  the 
likelihood  model.  Representations  and  algorithms  from  computer  graphics,  originally  designed  to 
produce  high-quality  images,  are  instead  used  as  the  deterministic  backbone  for  highly 
approximate  and  stochastic  generative  models.  This  formulation  combines  probabilistic 
programming,  computer  graphics,  and  approximate  Bayesian  computation,  and  depends  only  on 
general-purpose,  automatic  inference  techniques. 

Example  applications  of  this  approach  include:  (i)  reading  sequences  of  degraded  and 
adversarially  obscured  alphanumeric  characters;  (ii)  inferring  3D  road  models  from  vehicle- 
mounted  camera  images;  (iii)  inferring  3D  models  of  faces  from  single  images,  and  rendering 
those  models  in  different  viewpoints  and  lighting  conditions;  (iv)  inferring  models  of  3D  human 
bodies  from  single  images,  including  challenging  images  from  outdoor  scenes  and  bodies  that  are 
partially  occluded;  and  (v)  inferring  3D  models  of  radially  symmetric  objects  such  as  cups  and 
lamps  and  bottles  from  household  images.  Each  of  the  probabilistic  graphics  programs  we 
present  relies  on  under  50  lines  of  probabilistic  code,  and  supports  accurate,  approximately 
Bayesian  inferences  about  ambiguous  real-world  images. 
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Figure  17.  Computer  vision  (bottom  path)  as  the  inverse  problem  to  computer  graphics  (top  path). 
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Inverse  graphics 


Bottom-up  computer  vision 


Method_ Acainicy 


Alyeta!  68.31% 

GPGP  (Best  Single  Appearance)  64.36% 

GPGP  (Maximum  Likelihood  over  Multiple  Appearances)  74.60% 

Figure  18.  An  illustration  of  generative  probabilistic  graphics  for  3D  road  finding.  One  the  left,  results  from  Aly  et  al  (2006).  On 
the  right,  typical  inference  results  from  the  proposed  generative  probabilistic  graphics  approach.  (Kulkarni  et  al.,  2015  [2]) 
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Figure  19.  Four  input  images  from  our  CAPTCHA  corpus,  along  with  the  final  results  and  convergence  trajectory  of  typical 
inference  runs.  The  first  row  is  a  highly  cluttered  synthetic  CAPTCHA  exhibiting  extreme  letter  overlap.  The  second  row  is  a 
CAPTCHA  from  TurboTax,  the  third  row  is  a  CAPTCHA  from  AOL,  and  the  fourth  row  shows  an  example  where  our  system 
makes  errors  on  some  runs.  Our  probabilistic  graphics  program  did  not  originally  support  rotation,  which  was  needed  for  the 
AOL  CAPTCHAs;  adding  it  required  only  1  additional  line  of  probabilistic  code.  (Mansinghka  et  al.,  2013) 


2.5  MetaProb 

Probabilistic  programming  is  based  on  two  ideas:  (i)  representing  probabilistic  models  as 
programs  and  (ii)  representing  inference  and  learning  operations  as  meta-programs.  Most 
existing  probabilistic  languages  provide  a  small  set  of  built-in  meta-programs;  adding  new 
inference  and  learning  techniques  requires  extending  the  language  implementation.  Metaprob  is  a 
probabilistic  meta-programming  language  that  aims  to  remove  this  limitation.  In  Metaprob, 
partial  execution  traces  of  probabilistic  programs  are  first-class  objects.  The  Metaprob  interpreter 
can  execute  probabilistic  programs  against  partial  traces,  treating  traces  as  a  source  of  Pearl-style 
interventions  on  the  behavior  of  a  probabilistic  program.  Metaprob  (meta-)  programs  can  also 
trace  the  behavior  of  other  probabilistic  programs,  modify  these  traces,  and  use  interventions 
based  on  these  modified  traces  to  implement  Monte  Carlo  inference  strategies.  This  research  has 
shown  how  to  use  Metaprob’ s  language  constructs  to  implement  two  distinctive  features  of  the 
Church  probabilistic  programming  language — namely,  stochastic  memorization  and  Metropolis- 
Hastings  transitions  on  execution  traces — as  ordinary  user-space  meta-programs. 
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From  a  practical  standpoint,  Metaprob  is  an  unusually  suitable  vehicle  research  and  teaching  of 
probabilistic  programming  and  meta-programming,  much  like  Lisp  was  a  suitable  vehicle  for 
research  and  teaching  in  both  programming  techniques  and  in  language  design,  interpretation, 
and  compilation.  Metaprob  also  represents  our  best  attempt  to  solve  a  fundamental  problem 
posed  by  Dr.  Kathleen  Fisher  in  the  early  stages  of  the  PPAML  program:  is  there  an  intermediate 
representation  that  could  potentially  be  shared  among  multiple  probabilistic  programming 
languages,  and  a  simple  interface  that  allows  different  "solvers"  for  inference  problems  to 
interact  and  cooperate? 

This  report  gives  key  examples  of  Metaprob  programs  that  highlight  the  principles  of 
probabilistic  programming  that  we  discovered  and  developed  over  the  course  of  the  PPAML 
program. 
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3.0  METHODS,  ASSUMPTIONS,  AND  PROCEDURES 


For  all  the  research  we  performed  in  this  program,  we  adopted  an  experimental  approach  driven 
by  the  evaluation  scenarios.  Our  overarching  goal  was  to  formalize  and  simplify  the  development 
of  systems  that  perform  probabilistic  modeling  and  inference.  The  metrics  used  to  evaluate  our 
progress  include  the  lines  of  code  required  to  specify  problems  and  the  runtime  and  accuracy  of 
the  resulting  solutions. 

For  our  research  papers  and  program  presentations,  we  chose  to  evaluate  our  developed  systems 
on  both  internally  developed  benchmark  problems,  on  benchmark  problems  chosen  by  and  on 
DARPA  challenge  problems  developed  by  the  TAl  Galois  team.  Our  internal  experiments  and 
our  interactions  with  DARPA  evaluations  led  us  to  pursue  an  approach  involving  multiple 
probabilistic  language  implementations  —  Venture,  BayesDB,  Picture,  and  Metaprob  —  rather 
than  a  single  monolithic  system. 

Regular  engineering  meetings  and  research  collaboration  meetings  were  carried  out  between  the 
teams  working  together  on  this  program.  Additional  feedback  was  sought  by  meetings  with 
academic  and  corporate  partners,  presentations  at  academic  conferences,  and  academic  courses. 

During  the  course  of  the  program,  we  devoted  a  major  effort  to  the  packaging,  testing,  and 
usability  of  the  system.  Our  releases  included  both  unit  and  system  tests  (comprising  our 
regression  test).  We  also  devoted  major  effort  to  documentation  included  with  system 
deliverables.  Finally,  a  major  effort  was  devoted  to  open-sourcing  and  releasing  to  the  public  our 
analyses,  models,  and  tools. 

All  of  the  probabilistic  programming  languages  that  we  developed  run  on  open-source  software 
and  standard  hardware;  no  proprietary  software  or  hardware  is  required. 
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4.0  RESULTS  AND  DISCUSSION 


This  section  presents  results  from  a  broad  class  of  probabilistic  programs  in  multiple  languages 
that  were  developed  over  the  course  of  the  program.  It  includes  results  on  program  challenge 
problems  developed  by  DARPA,  as  well  as  results  on  problems  that  were  developed  internal  to 
our  team  for  testing  and  benchmarking  purposes. 

1.  Results  for  solving  inference  problems  using  custom  inference  strategies  written  in 
Venture.  These  include  quantitative  results  for  a  set  of  benchmark  problems,  some  of 
which  were  inspired  by  the  PPAML  "Small  Problems"  benchmark  set  developed  by 
Galois  as  part  of  the  challenge  problems. 

2.  Example  probabilistic  programs  and  meta-programs  written  in  Metaprob.  Metaprob 
is  a  probabilistic  meta-programming  language  that  was  initially  developed  as  a  set  of 
extensions  to  Venture. 

3.  Example  results  drawn  from  program  challenge  problems,  specifically  (i)  the  January 
2017  Hackathon  on  the  Gapminder  database  and  (ii)  Challenge  Problem  7:  Flu 
forecasting. 

4.  Example  results  for  our  team  challenge  problem  of  analyzing  a  database  of  Earth 
satellites  using  BayesDB.  This  section  also  includes  background  material  on  BayesDB, 
including  a  brief  overview  of  its  architecture  and  pointers  to  key  technical  innovations.  It 
also  illustrates  example  capabilities  that  are  analogous  to  those  evaluated  by  DARPA 
during  the  PPAML  Hackathons. 

5.  Example  results  showing  how  probabilistic  programming  can  be  used  to 
reimplement  the  "Automatic  Statistician"  in  under  70  lines  of  code  in  Venture  . 

These  results  show  that  probabilistic  meta-programs  written  in  Venture  can  solve  hard 
model  discovery  problems. 

6.  Example  results  showing  how  probabilistic  programming  can  be  used  to  solve 
problems  of  visual  scene  understanding.  These  emphasize  results  from  Picture,  a 
domain-specific  probabilistic  programming  language  we  developed  for  this  type  of 
problem,  motivated  in  part  by  DARPA's  interest  in  the  domain. 


4.1  Results  for  solving  inference  problems  using  custom  inference  strategies  written  in 
Venture^. 

Today,  practitioners  typically  solve  probabilistic  modeling  and  inference  problems  by 
developing  custom  inference  strategies  adapted  to  each  problem.  This  is  true  in  a  broad  class  of 


^  This  section  is  largely  taken  from  Mansinghka  et  al.  (2018,  in  review) 
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fields  (Gelman  et  al.,  2014;  Hahnel  et  al.,  2003;  Liu,  2008;  Murphy,  2012;  Russell  and  Norvig, 
2003;  Thrum,  Burgard,  and  Fox,  2005). 


Many  inference  strategies  work  by  breaking  inference  problems  down  into  subproblems  and 
repeatedly  applying  appropriate  probabilistic  inference  tactics  (algorithms)  to  each  subproblem. 
The  tactics  and  subproblems  are  chosen  so  that  the  results  converge  to  a  good  answer  to  the 
overall  inference  problem.  These  strategies  are  now  a  standard  approach  and  have  been  shown  to 
deliver  significant  accuracy  and  performance  benefits  in  this  context  (Andrieu  et  al.,  2003; 
Huang,  Tristan,  and  Morrisett,  2017;  Lindsten,  Jordan,  and  Schoen,  2014;  Murray,  2013;  Neal, 
2003;  Ranganath,  Gerrish,  and  Blei,  2014;  Wingate  and  Weber,  2013). 

Current  probabilistic  programming  languages,  however,  typically  provide  only  a  specific  black¬ 
box  inference  strategy  or  set  of  strategies  (Carpenter  et  al.,  2016;  Goodman  et  al.,  2008; 
Goodman  &  Stuhlmueller,  2014;  Gordon  et  al.,  2014;  Gordon  et  al.,  2014b;  Uber  Al  Labs,  2017; 
Milch  et  al.,  2007;  Tolpin,  van  de  Meent,  and  Wood,  2015;  Tristat  et  al.,  2014).  This  situation 
denies  developers  the  substantial  performance  and  accuracy  benefits  that  custom  inference 
strategies  can  provide.  To  the  best  of  our  knowledge,  the  only  exception  is  the  Venture  language 
(Mansinghka,  Selsam,  and  Perov,  2014). 

Custom  Inference  Strategies:  In  this  segment  of  the  report  we  present  new  probabilistic 
programming  language  constructs  that  enable  developers  to  specify  custom  inference  strategies. 
These  constructs  allow  developers  to  solve  probabilistic  inference  problems  by  identifying 
subproblems  and  applying  specified  probabilistic  inference  tactics  to  these  subproblems. 
Subproblems  are  specified  by  identifying  a  subset  of  target  stochastic  choices  made  by  a 
probabilistic  program  and  subset  of  absorbing  stochastic  choices  that  insulate  the  remaining 
computation  from  the  effects  of  any  changes  in  the  target  stochastic  choices.  Together,  these 
target  choices  and  absorbing  choices  dene  the  boundaries  of  the  subproblem.  Inference  tactics 
work  within  these  boundaries,  leaving  the  computation  outside  the  boundaries  unaffected.  Once 
the  subproblem  is  identified,  the  developer  can  then  specify  an  inference  tactic  to  use  on  that 
subproblem.  Together,  these  constructs  provide  a  exible  and  expressive  mechanism  for 
specifying  custom  inference  strategies. 

We  have  implemented  these  constructs  in  the  context  of  the  Venture  probabilistic  programming 
language  and  used  them  to  compare  the  performance  of  multiple  inference  strategies  for  solving 
a  set  of  benchmark  problems.  Our  experimental  results  indicate  that  no  single  inference  strategy 
is  optimal  for  all  problems.  These  results  highlight  how  custom  inference  strategies  can  improve 
inference  programming  and  accuracy  in  probabilistic  programming  languages. 

We  also  formalize  the  use  of  subproblem  selection  and  inference  tactics  for  probabilistic 
programs.  This  formalism  provides  a  framework  for  integrating  difference  inference 
algorithms/solvers  into  a  unified  probabilistic  programming  environment.  The  solvers  than 
becomes  composable  building  blocks  that  developers  can  use  to  implement  custom  inference 
strategies  that  solve  the  probabilistic  inference  problem  at  hand. 
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This  research  yielded  two  contributions: 


1.  Custom  inference  strategies.  It  showcases  probabilistic  programming  constructs  that 
enable  developers  working  in  probabilistic  programming  languages  to  specific  custom 
inference  strategies.  These  custom  strategies  enable  developers  to  identify  subproblems 
and  apply  specified  inference  tactics  to  each  subproblem.  These  constructs  make  it 
possible,  for  the  first  time,  for  developers  using  probabilistic  programming  languages  to 
specify  custom  inference  strategies  for  solving  probabilistic  modeling  and  inference 
problems. 

2.  Experimental  Results.  We  have  implemented  a  set  of  benchmark  probabilistic 
computations  in  Venture.  Experimental  results  from  these  benchmark  computations 
illustrated  the  performance  and  accuracy  benefits  that  custom  inference  strategies  can 
provide  in  this  context. 
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Figure  20.  Gaussian  Process  Structure  Learning  using  an  inference  strategy  that  applies  Metropolis-Hastings  to  subproblems. 
Note  the  accuracy  of  the  forecast  (green)  outside  of  the  regime  in  which  data  was  observed  (black  xs). 
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Figure  21.  Gaussian  Process  Structure  Learning  with  a  different  inference  strategy.  This  inference  result  shows  the  need  for 
custom  inference  strategies.  The  global  Metropolis-Hastings  strategy  shown  here  captures  the  linear  trend  of  the  data  but  does  not 
find  the  finer- grained  structure  in  the  time  series. 
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assuflw  tre«_root  -  ()  ->  {1}; 

assiiffltt  got_hyper_prior  mem  ((node,  index)  ->  { 

“log.logist ic (log_odds_uni f onn( )  thypers : node.index) 

}); 

assume  choose. primitive  •  (node)  ->  { 

base.kernel  categorical(simplex( .2,  .2,  .2,  .2,  .2), 
-C-.  -LIN-.  "SE-,  “PER"]) 
«structure:pair("ba8e_Kernal“ ,  node) ; 

cond( 

(base.kernel  ■*  'WN") 

((  "rt^i",  got_hyper_prior(pair('*WN‘* ,  node))]), 
(base.kernel  -•  "C") 

(C'C',  get. hyper .prior (pair ("C” ,  node))]), 
(base.kernel  --  "LIN") 

((’LIN",  get.hyper.prior(pair('*LIN" ,  node))]), 
(base.kernel  —  "SE") 

(("SE",  .01  get.hyper.prior(pair(  "SE"  ,  node))]), 
(base.kernel  ■■  "PER”) 

([•PER  ", 

.01  ♦  get  .hyper. prior  (pair  (  "PER  _  1 ,  node)), 

.01  +  get.hyper_prior(pair("PER_t  ,  node)) 

])) 

}; 

assume  choose. operator  ■  mem ((node)  ->  i 

operator.symbol  categorical  (simplex (0.45,  0.45,  0.1), 
('♦",  "CP"])  t3tructure:pair("operator'" ,  node); 

if  (operator.symbol  ■■  "CP")  { 

[operator.symbol,  get.hyper.prior(pair("CP  ,  node))] 

>  else  { 
operator.symbol 

> 

»; 

assume  generate.random.program  ■  mem((node)  ->  { 
if  (flip(.3)  fstructure :pair("branch” ,  node))  { 
operator  choose.operator(node) ; 

[ 

operator, 

generate. random.program(2  *  node) , 
generate. random.program(2  •  node  ♦  1) 

] 

}  else  { 

cboo8e.primitive(node) 

> 

}); 


(a)  Synthesis  model;  AST  prior 


assume  produce. covariance  -  (source)  ->  { 
cond( 

(8ource[0]  •«  "-N")  (gp.cov.scale(80urce[l] ,  gp.cov.bump)) , 
(source[0]  »»  "C")  (gp.cov.const(source[l] )) , 

(source[0]  ■■  "LIN')  (gp.cov.linear (source [l] )) , 

(8ource[0]  “  "SE")  (gp_cov.8e(source[l]»»2)) , 

(source [0]  «»  "PER") 

(gp.cov.periodic (sourced] ♦•2,  source [2]))  , 

(source  [0]  —  . .  ( 

gp.cov.sum( 

produce.covariance(so\irce  [1]  ) , 
produce.covarlance (source [2]))) , 

(source [0]  **  "•")  ( 
gp.cov.product ( 

produce.covarlance (source [1] ) , 
produce.covarlance (soiurce [2])))  , 

(source [0] [0]  ••  "CP")  ( 
gp.cov.cp( 

source [0] [1] ,  .1, 
produce.covarlance (source [1] ) , 
produce.covarlance (source [2] ) ) ) ) 

>; 

assume  produce.executable  ■  (source)  ->  { 

baseline. noise  •  gp.cov.scale( . 1 ,  gp.cov.bump); 
covariance.kernel  •  gp.cov.sum( 
produce.covarlance (source) , 
baseline.noise) ; 

make. gp(gp.mean. cons t (0. ) ,  covariance.kernel) 

>: 


(b)  Synthesis  model:  AST  interpreter 


Figure  22.  Venture  model  code  for  the  Gaussian  process  structure  learning  program.  This  probabilistic  model  code  defines  a 
probabilistic  model  over  structured  Gaussian  process  models  for  time  series  data. 
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assume  source  ~  generate_raiidom_program(tree_root() ) ; 
assume  gp_executable  =  produce_executable (source) ; 
define  xs  =  get_data_xs(" ./data.csv") ; 
define  ys  =  get _data_ys(" ./data. csv") ; 
observe  gp.executable (${xs})  =  ys; 

(a)  Data  observation  program 

f or_each(arangeCT) ,  (_)  ->  { 
infer 

resimulation.mhC 

minimal_subproblem(random_singleton/?hypers) ) ; 
infer 

res imul at i on_mh ( 

minimal_subproblem(random_singleton(/?structure) ) ) 

}); 

(b)  Synthesis  inference  strategy:  Subproblein-based,  single¬ 
site  MH 

f or.eachCarangeCT) ,  (_)  ->  { 

infer  resimul at ion.mh (minimal. subproblem(/*)) ; 

}); 

(c)  Synthesis  inference  strategy:  Subproblem-based,  single¬ 
site  MH 

Figure  23.  Venture  observation  and  inference  code  for  the  Gaussian  process  structure  learning  program.  This  code  shows 
different  custom  inference  strategies  for  discovering  Gaussian  process  model  structures. 
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(p)  (a 


> 


Figure  24.  A  graphical  model  representing  the  variables  and  dependencies  for  our  second  benchmark  problem:  a  stochastic 
volatility  model. 


Auto-correlation 


Auto-correlation 


(a) 


(b) 


Figure  25.  Stochastic  Volatility  Model  with  two  types  of  observation  sequences  and  three  inference  operators:  single-site 
Metropolis-Hasting  operating  on  individual  random  choices,  global  Metropolis-Hastings  for  all  random  choices,  and  particle 
Gibbs. 
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assume  precision  =  1  /  0. 05**2; 
assume  sigma  =  l/sqrt(precision) ; 
assume  phi  =  0.99; 

//  sigmaO  is  fixed, 
assume  h  =  mem((t)  ->  { 
if(t  <=  0)  { 

nonnaKO  ,  sigmaO)  #h:t 
}  else  { 

nonnaKphi  *  h(t-l)  *  sigma  ,  sigma)  #h:t 

} 

}> 

assume  x  =  (t)  ->  {normaKO,  h(t)/2)}; 

//  Observe  data, 
f or_each(arangeCT) , 

Ct)  ->  { 

observe  x(t)  =  X[t]  //X[t]  is  the  observation  at  time  t 

}) 

//  state  estimation  via  Particle  Gibbs  with  P  particles  for  N 
//  iterations 
f or_each(arangeCN) , 

(.)  ->  { 

infer  pgibbsCminimal_subproblem(ordered(/?h)) ,  P) 

}>; 

//  state  estimation  via  MH  applied  to  single  states  every 
//  iteration  for  N  *  T  *  P  iterations 
f or.eachCarangeCN  *  T  *  P) , 

C.)  ->  { 

infer 

resimulation_mh(minimal_subproblem(random_singleton(/?h) ) ) 

}); 

//  state  estimation  via  MH  applied  to  all  states  every  iteration 
//  for  N  *  P  iterations 
f or_each(arange(N  *  T  *  P) , 

C.)  ->  { 

infer  resimulation_mh(minimal_subproblem(/?h) ) 

}); 


Figure  26.  Shows  Venture  code  for  the  stochastic  volatility  model,  along  with  code  for  custom  inference. 
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Figure  27.  A  third  benchmark  problem:  logistic  regression  for  classifying  handwritten  digits  from  the  MNIST  database. 


assume  w  multivariate_normal (mu,  Sig)  #w; 

assume  y_x  =  (x)  ->  {bernoulliClinear_logisticCw,  x))}; 

//  Observe  data. 
for_each(arange (size (data) ) , 

(n)  ->  { 

//  Features  X[n]  =  [xO_n,  xl_n,  ...],  class  Y[n] 
observe  y_xCX[n])  =  Y[n] 

}) 

//  Do  T  iterations  of  MH  with  a  prior  proposal; 
f or_each(arangeCT) , 

(.)  ->  { 

infer  resimulation_mh(minimal_subproblem(/*) ) 

}); 

//  or  do  T  iterations  of  MH  with  a  random  walk  proposal  of 
//  bandwidth  sigma, 
f or_each(arange(T) , 

(.)  ->  { 
infer 

mh_with_gaussian_drift .proposal Cminimal_subproblem(/?w) ) 

}); 

Figure  28.  Venture  code  for  the  Bayesian  logistic  regression  example  from  the  previous  figure 
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//  Fixed:  dimensionality  D,  hypers  mu_w,  sig_w,  mO,  kO,  vO,  SO. 

assume  alpha  ~  gamma ( 1 ,  1)  #alpha; 

assume  crp  =  make_crp (alpha) ; 

assume  z  =  mem((i)  ->  {crpO  #z:i}); 

assume  w  -  mem((z)  ->  { 

multivariate_normalCmu_w,  sig_w)  #w:z) 

}; 

assume  c  =  mem((z))  ->  { 

make_collapsed_multivariate_normalCmO ,  kO,  vO,  SO) 

}; 

assume  x  =  (i)  ->  {c(z(i))}; 

assume  y  =  (i,  x)  ->  {bernoulli(linear_logisticCw(zCi)) ,  x))}; 
//  Observe  data. 
for_each(arange (size (data) ) , 

(n)  ->  { 

observe  x(n)  =  X[n] ;  //  X[n]  is  n-th  feature  vector, 
observe  y(n)  =  Y[n]  //  Y[n]  is  n-th  class  label. 

}); 

//  T  steps  of  MH  for  hyperparams,  single-site  (collapsed)  gibbs 
//  for  z,  MH  over  weights  for  a  randomly  chosen  cluster 
define  mh_with_gibbs  =()->{ 

resimulat ion_mh (minimal_subproblem(/? alpha) ) ; 
gibbs(minimal_subproblem(random_singleton(/?z))) ; 
mh_with_gaussian_drift_proposal (minimal_subproblem(/?w) ) 

}; 

f or_each(arange (T) , 

(_)  ->  {infer  mh_with_gibbs () } 

); 

//  or  use  the  default  inference  operator  for  about  the  same 
//  amount  of  computations  (two  input  clusters  are  inferred 
//  in  the  example), 
f or_each(arange(T  ♦(l+N+2)), 

(_)  ->  { 
infer 

res imulation_mh(minimal_sub problem (r andom_singlet on (/*) ) 

) 

}); 


Figure  29.  Venture  code  for  a  more  complex  Dirichlet  process  mixture  of  experts  model,  extending  the  previous  logistic 
regression  benchmark. 
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Figure  30.  Accuracy  results  for  the  Dirichlet  process  mixture  of  experts,  (a)  training  data,  (b)  prediction  on  test  data  with  s  =  0.3 
after  2  hours.  6  clusters  are  found  with  decision  boundaries  (dashed  lines)  and  mis-classified  points  (black  dots),  (c)  Predictions 
accuracy  vs  running  time  in  log  domain. 
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//  Model  component, 
assume  disease.prior  ■  0.01; 

assume  disease. 1  ^  flipCdisease .prior)  tsymptomArO; 
assume  disease.2  ~  flip(disease .prior)  #symptom> : 1 ; 
assume  disease.3  ^  flipCdisease .prior)  #symptomA:2  #8ymptom6:0; 
assume  disease-4  ^  flipCdisease .prior)  #symptomB:l; 
assume  disease.5  flipCdisease  .prior)  tsyaptomB  :2; 

assume  disease-6  ^  flipCdisease .prior)  fsymptomB:3; 
assume  get. causal. link  ■  Cparent)  ->  { 
if  Cpsurent)  { 

0.01  //  Probability  of  a  leak 
}  else  { 

1. 

} 

>; 

assume  get. effect  -  Cparents,  effect)  ->  { 
if  CsizeCparents)  0)  { 
effect 
}  else  { 
got.effectC 

rest Cparents) , 

effect  *  get.causal.linkCfirst Cparents)) 

) 

> 

>; 

assume  noisy.or  »  Cparents)  { 
p. spontaneous  «  0.001; 

flipC 

1  -  CCl  -  p. spontaneous)  *  get.ef feet Cto.list Cparents) ,  1)) 

) 

li _ 


//  Observation  component, 
observe 

noisy.or C [disease. 1 ,  disease-2,  disease-3])  «  True; 
observe 

noisy.or C [disease-3,  disease-4,  disease-5,  disease.6])  ■  True 
//  Inference  component:  global  MH. 
f or. eachCarangeCn.inference. steps) , 

C.)  ->  { 

infer 

resimulation.mhCminimal.subproblemC/*) ) 

»: 

It  Inference  component  SSMH 
for.eachCarangeCn.inference. steps) , 

C.)  ->  < 
infer 

res imulat ion.mh  Cminimal.subproblem  Crandom.singlet onC/ • ) ) 

»; 

//  Inference  component:  single-site  Gibbs  sampling, 
for .each Carange Cn.inference. steps) , 

C_)  ->  { 

infer 

gibbsCminimal.subproblemCrandom.singletonC/*))) 

»; 

//  Inference  component:  Gibbs  sampling  --  block  proposals, 
for .eachCarangeCn.inference. steps) , 

CJ  ->  < 
infer 

gibbs Cminimal.subproblem C/?8ymptomA)) ; 
gibbs  Cmin imal .subproblem  C /Tsympt  omB) ) 

»; 


Figure  31.  Venture  model  component,  observation  component,  and  inference  components  for  a  bipartite  Bayesian  network 
inference  problem. 
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Error  (averaged  over  diseases) 
for  different  inference  strategies 

resimulation_mh 
single_site_mh 
singlesite^gibbs 
block_gibbs 

0.000.050.100.15  0.200.250.30 

Error 


Figure  32.  A  comparison  of  the  accuracy  of  custom  inference  strategies  on  the  bipartite  Bayesian  network  Inference  accuracy  is 
calculated  as  the  average  difference  in  probabilities  of  each  state  obtained  from  a  single  Markov  chain  run  for  100  samples  and 
the  ground  truth  probabilities. 


Here  we  have  presented  language  constructs  for  specifying  custom  probabilistic  inference 
strategies  in  probabilistic  programming  languages.  These  constructs  enable,  for  the  first  time, 
developers  using  probabilistic  programming  languages  to  identify  subproblem  and  apply 
specified  inference  tactics  to  the  identified  subproblems.  Results  from  benchmark  programs 
implemented  in  the  Venture  probabilistic  programming  language  highlight  the  significant 
accuracy  and  efficiency  benefits  that  the  use  of  customized  inference  strategies  can  deliver  in  this 
context. 

4.2  Example  probabilistic  programs  and  meta-programs  written  in  Metaprob. 

This  section  gives  example  probabilistic  programs  and  meta  programs  written  in  MetaProb;  most 
of  the  following  segment  of  the  report  is  taken  from  Radul  and  Mansinghka  (2018,  expected). 

Probabilistic  programming  is  based  on  two  ideas:  (i)  probabilistic  models  can  be  represented  as 
executable  computer  programs  and  (ii)  inference  and  learning  algorithms  can  be  represented  as 
meta-programs  that  take  these  programs  as  inputs  and/or  produce  them  as  outputs.  These  ideas 
are  the  basis  of  a  rapidly  growing  collection  of  languages  and  implementations.  Each  language 
provides  some  “built-in”  probability  distributions,  plus  language  constructs  for  combining  them 
to  larger  programs  that  represent  meaningful  probabilistic  models.  Each  language  also  provides  a 
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small  set  of  “built-in”  inference  algorithms  that  can  infer  likely  outcomes  from  these  probability 
distributions  given  data. 

Unfortunately,  existing  languages  do  not  provide  language  constructs  for  adding  new  probability 
distributions  or  inference  algorithms.  If  the  built-in  set  is  inadequate  for  some  problem,  the 
probabilistic  programmer  needs  to  either  choose  a  different  language  or  learn  how  to  extend  the 
language  implementation.  Most  existing  languages  also  do  not  provide  constructs  for  combining 
built-in  general-purpose  algorithms  with  user-specified  algorithms  that  are  customized  for  a 
given  application.  These  are  fundamental  limitations.  In  practice,  they  restrict  the  use  of 
probabilistic  programming  to  applications  where  a  small  set  of  probability  distributions  and 
generic  inference  algorithms  are  known  to  be  adequate  ahead  of  time.  They  also  greatly  limit  the 
extent  to  which  probabilistic  programmers  can  take  advantage  of  advances  in  algorithms  for 
approximate  inference. 

Metaprob  is  a  simple,  extensible  language  for  probabilistic  programming  and  meta¬ 
programming.  Metaprob  is  simple  enough  that  an  implementation  of  the  language  fits  on  a  single 
page,  requiring  under  100  lines  of  Metaprob  code.  Also,  in  Metaprob,  new  probability 
distributions  and  inference  algorithms  can  written  as  ordinary  user-space  programs.  Metaprob  is 
extensible  and  expressive  enough  that  distinctive  features  of  other  probabilistic  languages,  such 
as  Metropolis-Hastings  inference,  sequential  Monte  Carlo  inference,  and  modeling  with  Dirichlet 
processes,  can  each  be  written  in  under  one  page  of  Metaprob  code.  Inference  in  Metaprob  is  not 
limited  to  Monte  Carlo  sampling. 

It  is  helpful  to  view  MetaProb  in  terms  of  an  analogy  to  Lisp.  Lisp  is  well  known  both  for  its 
unusual  simplicity  and  for  its  support  for  meta-programming.  Lisp  source  code  follows  a  small 
set  of  rules,  and  is  easy  to  represent  using  lists,  the  central  data  structure  in  the  language.  The 
process  of  Lisp  program  execution  also  follows  a  small  set  of  rules  that  closely  mirror  the 
structure  of  Lisp  source  code.  Thus  Lisp  interpreters  can  be  written  as  short  Lisp  programs. 
Executable  Lisp  programs  can  also  be  represented  in  the  language,  as  “procedures”  that  take  a 
specified  list  of  formal  parameters.  These  procedures  are  ordinary  data  Lisp  objects  that  can  be 
produced  and  executed  by  other  Lisp  programs.  This  combination  of  simplicity  and 
expressiveness  helped  make  Lisp  a  widely  used  language  for  research  in  artificial  intelligence 
and  in  programming  language  design  and  implementation.  Lisp  also  provides  a  simple  model  of 
computation  that  has  been  useful  for  teaching. 

The  design  of  Metaprob  aims  to  preserve  the  simplicity  and  expressiveness  of  Lisp,  while  adding 
the  minimal  additional  features  needed  for  research  in  probabilistic  artificial  intelligence  and  in 
probabilistic  programming.  Metaprob  also  aims  to  serve  as  a  simple  model  of  probabilistic 
computation  that  can  be  used  to  teach  the  key  concepts  to  a  broad  audience.  This  model  of 
computation  differs  from  the  Lisp  model  in  three  fundamental  ways: 

1 .  The  built-in  Metaprob  interpreter  allows  the  execution  of  a  Metaprob  program  to  be 
intervened  on,  i.e.,  forced  to  take  on  specific  values  at  chosen  points  in  a  program’s  execution. 
Metaprob  provides  language  constructs  for  naming  points  in  a  program’s  execution  and  a  data 
structure  called  a  trace  for  storing  mappings  between  execution  points  and  values.  Metaprob’ s 
syntax  is  designed  to  make  it  easy  to  use  these  capabilities. 
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2.  Probabilistic  programs  in  Metaprob  constitute  extensible  packages  of  executable  code,  source 
eode,  meta-data,  and  meta-programs.  These  packages  can  be  re-opened  by  other  Metaprob 
(meta)programs,  for  example  to  retrieve  the  source  code  of  a  program  or  to  retrieve  a  program 
that  speeifies  the  output  probability  distribution  of  some  other  program. 

3.  Typical  Metaprob  programs  rely  on  a  standard  library*  of  “inference  meta-programs”,  not  just 
a  built-in  interpreter  or  eompiler.  These  inferenee  meta-programs  take  a  probabilistic  program 
and  also  a  “target”  trace  as  input,  and  generate  executions  of  the  program  that  are  compatible 
with  the  “target”  traee. 

These  differences  constitute  the  core  contributions  of  Metaprob’ s  design.  Here,  we  introduce  the 
language  features  that  underlie  these  differenees  and  shows  that  they  are  suffieient  to  implement 
a  broad  class  of  techniques  for  probabilistic  modeling  and  inference.  Applications  are  drawn 
from  hierarehieal  Bayesian  modeling,  Bayesian  networks,  and  non-parametric  Bayesian 
statistics.  To  the  best  of  our  knowledge,  Metaprob  is  the  first  probabilistic  language  with  an 
explicit  meta-circular  implementation,  and  the  only  probabilistie  language  in  whieh  the 
distinctive  modeling  and  inference  features  of  expressive  languages  such  as  Church  can  be 
implemented  as  short  (meta)programs.  We  also  believe  that  Metaprob  is  the  only  probabilistie 
language  with  support  for  interventions  and  for  adding  new  probability  distributions  within  the 
language. 

The  remainder  of  this  section  gives  a  tutorial  introduction  to  Metaprob.  It  is  also  helpful  to 
navigate  this  seetion  in  light  of  the  main  Metaprob  example  programs  that  it  presents.  The  full 
set  of  example  programs  we  have  developed  for  Metaprob  so  far  include  the  following: 

1 .  Inferring  if  a  coin  is  tricky  or  fair  via  a  Monte  Carlo  inference  in  a  hierarehieal  Bayesian 
model. 

2.  Comparing  inference  given  observations  to  inference  given  interventions  on  a  discrete- 
variable  Bayesian  network,  using  both  approximate  and  exact  inference  algorithms. 

3.  Implementing  an  interpreter  for  Metaprob  that  supports  overriding  the  execution  of  a 
program  via  interventions. 

4.  Implementing  a  traeing  interpreter  for  Metaprob  that  reeords  the  stoehastic  choices  made 
by  a  probabilistic  program  during  its  execution. 

5.  Adding  the  Gaussian  probability  distribution,  packaging  code  for  a  sampler  with  code  for 
the  log  output  probability  density  of  the  Gaussian. 

6.  Performing  inference  via  rejeetion  sampling  and  sampling-importanee-resampling,  i.e. 
via  sampling,  weighting,  and  resampling  traces  based  on  their  compatibility  with  a  trace 
representing  the  eonstraints  on  the  exeeution  of  a  program. 

7.  Implementing  an  approximate  inference  meta-program  that  changes  a  single  random 
ehoiee  at  the  time.  This  program  re-implements  the  single-site  Metropolis-Hastings 
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inference  algorithm  that  is  the  basis  of  probabilistic  programming  languages  such  as 
Church  and  WebPPL. 


8.  Implementing  an  inference  algorithm  that  enumerates  all  possible  executions  of 
probabilistic  programs  with  finite  support  and  deterministic  control  flow. 

9.  Adding  the  Chinese  restaurant  process,  and  adding  a  memoizer  to  the  language,  both  of 
which  use  traces  to  store  their  internal  state. 

10.  Putting  these  pieces  together  to  build  a  Chinese  restaurant  process  mixture  model  and 
using  it  to  infer  clusters  from  numerical  data.  This  example  was  a  capstone  example  in 
the  original  paper  on  the  Church  probabilistic  programming  language  that  helped  to 
crystallize  growth  of  the  probabilistic  programming  field. 
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//  (A)  Define  a  function  whose  traces  will  serve  as  the  model 
flip_coins  =  (n)  ~>  { 
root_addr  =  &this; 
tricky  =  f lip  (0 . 1 ) ; 

weight  =  if  (tricky)  {uniform(0,  1);}  else  {0.5;}; 
map((i)  ~>  with_address  /$root_addr /datum/$i/ :  f lip(weight) , 
range(n)) ; 

}; 


//  (B)  Run  it  once 
a.trace  =  {{  }}; 
trace_choices  ( 
program 
inputs 

intervention_trace 
outpu  t_t  race 


flip_coins  , 
[5], 

{{  }}, 
a_trace ) ; 


//  (C)  Have  a  look  at  (a  sample  of)  whether  the  coin  is  tricky 
print (*a_trace [/I /tricky /flip/]); 

False 

1 1  (D)  Force  the  coin  to  come  up  heads  every  time 

a_trace [/datum/0/f lip/]  :=  true; 

a_trace [/datum/1 /flip/]  :=  true; 

a_trace [/datum/2/f lip/]  :=  true; 

a_trace [/datum/3/f lip/]  :=  true; 

a.trace [/datum/4/f lip/]  :=  true; 


//  (E)  Define  a  Metropolis -based  inference  strategy 
approx imate_inference_update  =  ()  ~>  { 
single_site_metropolis_hastings_step( 
program  =  flip_coins, 
inputs  =  [5] , 
trace  =  a_trace  , 

constrai nt_addresses  =  set_difference ( 
addresses_of(a_ trace), 

[/I /tricky /flip/,  /2/weight/then/0/uniform/])); 


//  (F)  Run  it  a  bit 

repeat(times  =  20,  program  =  approximate_inference_update) ; 

//  (G)  Look  at  (a  sample  of)  the  (inferred)  trickiness 
print (*a_trace [/I /tricky/flip/]); 

False 


Figure  33.  A  Metaprob  transcript,  modeling  n  flips  of  a  potentially  biased  coin. 


In  this  example,  the  prior  is  that  the  coin  has  an  0. 1  probability  of  being  biased,  in  which  case  it 
has  an  unknown  uniformly  distributed  weight;  otherwise  is  fair.  Each  output  is  a  sample  from  the 
distribution  on  values  of  the  tricky  variable  at  that  point;  they  will  vary  with  the  initial  entropy 
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given  to  the  pseudo-random  number  generator.  (Radul  and  Mansinghka,  2018). 


Figure  34.  Two  traces  of  executing  the  flip_coins  program  from  Figure  1  for  two  flips. 

The  example  in  the  top  pane  takes  the  “tricky  coin”  control  flow  path;  note  the  coin  weight 
drawn  from  the  Beta  distribution.  The  example  in  the  bottom  pane  takes  the  “fair  coin”  control 
flow  path.  The  coin  weight  (0.5)  is  not  explicitly  recorded  there  because  it  is  deterministic  in  the 
source  code.  The  large  node  on  the  left  is  the  root  of  the  trace.  We  can  name  any  node  (or 
subtrace)  by  listing  the  edge  labels  that  lead  to  it  from  the  root.  This  is  an  address,  which  we 
write  slash-bounded,  by  analogy  with  URLs.  For  example,  the  Boolean  indicating  whether  the 
coin  is  tricky  lives  at  the  address  /1/tricky/flip/  (Radul  and  Mansinghka,  2018). 
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earthquake_bayesian_network  =  ()  -> 
earthquake  =  flip(0.1); 
burglary  =  flip(0.1); 
p.alartn  = 

if  (burglary  &&  earthquake)  0.9 
else  if  (burglary)  0.85 
else  if  (earthquake)  0.2 
else  0.05; 


alarm 

p_ j  ohn_call 
john_call 
p_mary_call 
mary_call 
’’  ok  ”  ; 


flip(p_alarm) ; 
if  (alarm)  0.8  else 
flip(p_john_call)  ; 
if  (alarm)  0.9  else 
flip(p_mary_call)  ; 


}; 


{ 


0.1  ; 

0.4; 


query  =  (state) 
earthquake  = 
burglary  = 

alarm  = 

john_call  = 
mary_call  = 
(earthquake , 


->  { 

*state[/0/ earthquake/flip/]; 
*state[/1/burglary/flip/] ; 
*state[/3/alarm/flip/] ; 
*state[/5/john_call/flip/]; 
*state[/7/mary_call/flip/]; 
burglary,  alarm,  john_call  ,  mary_call) 


Figure  35.  Definition  of  the  classic  earthquake-burglary  Bayes  net  in  Metaprob,  including  a  function  for  collecting  the  variables 
of  interest  from  a  trace  of  its  execution.  (Radul  and  Mansinghka,  2018) 
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alarm_went_of f  =  {{  }}; 
alarm_went_of  f  [/3/alarin/f  lip/]  :=  true; 

inter_trace  =  ()  ^>  { 
t  =  {{  »; 
trace.choices ( 
program 
inputs 

intervention_trace 
output_t race 
t; 

}; 

inter_traces  =  replicate( 

times  =  n_samples ,  program  =  inter_t race) ; 

discrete_histogram( 

traces  =  inter_traces , 

label.func  =  query , 

title  =  ” Alarm_intervened_(samples) ") ; 


=  earthquake_bayesian_network  , 
=  []. 

=  alarm.wen t_of f  , 

=  t); 


Figure  36.  The  joint  distribution  induced  by  intervening  on  the  earthquake-burglary  Bayes  net  to  set  the  alarm  variable  to  true. 

The  top  row  shows  code  for  sampling  from  the  distribution  on  traces  of  this  Bayes  net  (top  left) 
and  a  graphical  representation  of  the  network  induced  by  the  intervention  (top  right).  The  bottom 
row  shows,  from  left  to  right,  the  empirical  distribution  of  sampling  from  it  and  the  exact 
distribution  computed  by  enumerating  over  possible  states. 
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Metaprob  syntax 

Semantic  effect 

Result 

trace  =  {{  »; 

Literal  empty  trace 

o 

trace[/a/]  :=  3; 

Assign  3  at  the  key  a 

o*© 

trace[/a/b/]  :=  4; 

Assign  4  at  the 
two- key  address  /a/b/ 

OKIH-O 

subtrace  =  trace[/a/]; 

Obtain  the  subtrace 
at  the  key  a 

*subtrace; 

Fetch  the  value  at  the 
root  node  (of 
subtrace) 

3 

trace[/a/b/]  :=  7; 

Repeated  assignments 
overwrite 

subtrace; 

The  subtrace  is  an 
alias,  so  sees  the 
modification 

© 

0 

trace[/c/d/]  :=  8; 

Traces  have  unlimited 
fanout,  and  nodes 
may  lack  values 

O00 

trace  has_value; 

Postfix  operator 
checking  for  a  value  at 
the  root  node 

False 

addresses_of (trace) ; 

List  of  addresses  that 
have  values 

/a/,  /a/b/,  /c/d/ 

{{  1,  z  =  2,  a  = 
**subtrace,  */c/d/  = 
8}}; 

Trace  literal  syntax 
can  include  a  root 
value,  assginment  by 
key,  and  splicing  of 
subtraces  (with  **)  or 
addresses  (with  *) 

0^^— 0 

Figure  37.  Basic  operations  on  Metaprob  traces,  showing  the  Metaprob  cod  and  results. 
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single_site_metropolis_hastings_step  = 

(program,  inputs,  trace,  constraint_addresses)  ~> 

{ 

//  Choose  a  target  site  to  propose  uniformly 
choice_addresses  =  addresses_of ( trace ) ; 
candidates  =  set_difference (choice_addresses , 

constraint.addresses); 

target_address  =  uniform_sample (candidates) ; 


//  Compute  a  proposa 
ini  t ial_value  = 

initial_num_choices 
del  trace[target_add 
new_trace  =  {{  }}; 
(_,  forward_score)  = 


new_value 


1  trace 

*trace[target_address]; 
=  length (candidates)  ; 
ress ] ; 


propose, 
progra 
inputs 
interv 
target 
output 
*  new_t ra 


and_trace_choices( 


m 

ent ion.trace 
.trace 
.trace 
ce[ target. address]; 


program  , 
inputs  , 

{{  }}, 
trace  , 
new.t race 


); 


//  The  proposal  is  to  move  from  trace  to  new. trace; 

//  Now  compute  the  acceptance  ratio. 

new. choice. addresses  =  addresses. of (new. trace) ; 

new.candidates  =  set. difference(new. choice. addresses , 

constraint.addresses); 
new. num. choices  =  length ( new.candidates ) ; 
restoring. trace  =  {{* $target. address  =  ini  tial. value }} ; 
for.each(  set.difference (choice. addresses , 

new. choice. addresses ) , 

(addr)  ->  {  restoring. trace [addr]  :=  *  trace [addr ] ;  }); 
del  new. trace [target. address] ; 

(_,  reverse. score)  =  propose( 

program  =  program  , 

inputs  =  inputs  , 

intervent  ion. trace  =  restoring.t race  , 

target. trace  =  new. trace); 

new.trace [target. address ]  :=  new.value; 
log. p. accept  =  forward. score  -  reverse. score 

-  log( initial. num. choices)  +  log ( new.num.choices ) ; 


if  (log(uniform(0 ,  1))  <  log. p. accept )  {  //  Accept 
for.each(  set.difference (choice. addresses , 

new.choice.addresses ) , 

(addr)  ->  {  del  trace[addr];  }) ; 
for.each(  new.choice.addresses, 

(addr)  ->  {  trace[addr]  :=  *new_trace [ addr ] ;  }) ; 
}  else  {  //  Reject 

trace [target. address]  :=  ini tial.value ; 

}; 

}; 


Figure  39.  A  user-space  meta-program  for  performing  a  single  step  of  “lightweight”  Metropolis-Hastings  inference.  The  program 
uses  standard  meta-programs  for  proposing  program  traces  subject  to  constraints  to  create  a  proposal  trace  and  calculate  the 
acceptance  ratio.  (Radul  and  Mansinghka,  2018) 
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mcmc.sample  =()'>{ 
result  =  {{  }}; 

constraints  =  addresses.of (target  )  ; 
trace.choices  ( 

program  =  normal_normal  , 

inputs  =  args , 

intervent ion_trace  =  target, 

output_trace  =  result); 

repeat(times  =  s,  program  =  ()  ~>  { 
single_site_metropolis_hastings_step( 
program  =  normal_normal , 
inputs  =  args  , 
trace  =  result , 

constraint_addresses  =  constraints); 

»: 

*result[/0/x/gaussian/]; 

binned_histogram( 

samples  =  replicate (times  =  20,  program  =  mcmc_sample ) , 
overlay.densities  = 

{{  prior  =  prior_density ,  posterior  =  analytic.density  }}, 
title  =  "MCMC„("  +  string(s)  +  "„resimulation„steps) ”) ; 


X 


Figure  40.  Markov-chain  Monte  Carlo  inference  over  the  user-space  Gaussian  distribution.  The  top  panel  is  a  transcript  of 
invoking  MCMC  (with  a  prior  resimulation  proposal),  and  the  bottom  panel  shows  histograms  of  the  samples  thus  obtained. 
(Radul  and  Mansinghka,  2018) 
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generate_from_dirichlet_process_mixture  = 
(num.datapoints)  ~>  { 

root.addr  =  &this; 


alpha 

gamma  (1.0,  1.0); 

sample_assignment 

= 

make_chinese_restaurant_sampler (alpha) 

get_cluster 

mem(  (i)  ~>  {  sample_ass ignment () ;  }  ) 

a 

- 

inverse_gamma  (3 . 0 ,  10.0); 

b 

- 

uniform  (0 . 0 ,  10.0); 

mu 

-- 

uniform(-1 50. 0 ,  150.0); 

V 

inverse_gamma (2 ,  100); 

get_mu 

= 

mem(  (cluster)  ~>  { 

normal (mu ,  V  x  get_sigma (cluster)) ; 
}  ); 


ge 

t_s 

igma 

= 

mem  ( 

(  cli 

Lister)  ~> 

sqr 

t  (  i  nverse 

_gamma 

(a , 

b)); 

} 

): 

ge 

t_d 

atapoint 

= 

mem  ( 

(i) 

~>  { 

clu 

ster  =  get.clus 

ter  ( 

i) ; 

wit 

h_address 

/$roo 

t_addr  /d. 

atum/ $i / : 

normal(  get_ 

mu ( clu 

ster 

). 

get_ 

sigma  ( 

clus 

ter) 

); 

} 

); 

ma 

P(g 

et_datapo 

int  ,  r 

ange 

(num. 

_datapoin 

}; 


Figure  41.  Metaprob  source  code  for  a  program  that  generates  data  from  a  Dirichlet  process  mixture  of  Gaussians  with  randomly 
chosen  hyper-parameters.  This  program  uses  user-space  implementations  of  make_chinese_restaurant_sampler  and  mem.  It  also 
uses  with_address  to  ensure  that  the  data  generated  by  this  sampler  is  easy  to  find,  in  traces  of  this  program. 
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target_trace  =  {{  }}; 

data  =  [100,  101,  -99,  -100,  99,  -101,  100.5,  99.5,  -100.5,  -99.5] 

for_each ( range ( length ( data )) ,  (i)  ->  { 

ta rget_t race [/ datum /$i /normal /]  :=  *data[/$i/]; 

»; 

data_addresses  =  add resses_of ( target_t race ) ; 
markov_chain_state  =  {{  }}; 
trace_choices  ( 

program  =  generate_f rom_di richlet_process_mixture , 

inputs  =  [ length (data)] , 

//  ensures  the  initial  state  has  the  data 
intervention_trace  =  target_trace  , 

output_trace  =  markov_chain_state 

); 

approximate_inf erence_update  =  ()  ~>  { 
single_site_metropolis_hastings_step( 

program  =  generate_f rom_di richlet_process_mixture , 
inputs  =  [ length ( data )] , 
trace  =  markov_chain_state  , 
constraint.addresses  =  data_add resses 

): 

}; 

repeat  ( 

times=1 000 , 

program  =  approximate_inf erence_update 

): 

all_data  = 
interpret ( 

program  =  generate_f rom_di richlet_process_mixture , 
inputs  =  [ length ( data )  +  100],  //  100  predictive  samples 
intervent ion_trace  =  mar kov_cha in_state  ); 
predictive_data  =  al l_data [ length (data ): end ] ; 


Figure  42.  Metaprob  source  code  for  a  meta-program  that  adds  an  observed  dataset,  does  1000  steps  of  lightweight  Metropolis- 
Hastings  inference  on  executions  of  the  program  from  Figure  33,  and  then  generates  100  samples  from  the  predictive  distribution. 
This  meta-  program  traces  the  program  to  obtain  an  initial  state,  runs  1000  transitions,  then  re-executes  the  original  program 
treating  this  final  trace  as  source  of  interventions,  to  generate  samples  from  the  posterior  predictive.  (Radul  and  Mansinghka, 
2018) 
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Trace  plots  of  three  relevant  features  of  the  Markov  chain,  for  3  random  runs 


M-H  step 


Figure  43.  Progress  of  inference  in  the  Dirichlet  process  mixture  of  Gaussians  model  with  hyperparameter  inference  from  Figure 
42,  above.  The  top  panel  shows  the  evolution  of  the  log  probability  (density)  of  the  current  state.  Note  that  the  logarithm  is  itself 
displayed  on  a  logarithmic  scale.  The  center  panel  shows  the  Shannon  entropy  of  the  cluster  assignment  in  bits:  0  corresponds  to 
all  points  in  one  cluster,  1  to  the  points  split  evenly  among  two  clusters,  etc.  The  bottom  panel  shows  the  values  taken  by  the  V 
hyperparameter,  which  controls  the  cluster  spread  to  cluster  width  ratio.  (Radul  and  Mansinghka,  2018) 
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4.3  Example  results  drawn  from  program  challenge  problems 

This  section  of  the  report  summarizes  representative  results  on  (i)  the  January  2017  Hackathon 
results  on  the  Gapminder  dataset  and  (ii)  Challenge  Problem  7:  Flu  forecasting.  Additional 
program  challenge  problems  and  team  challenge  problems  are  discussed  elsewhere  in  this  report. 

4.3.1  January  2017  Hackathon  -  Gapminder  Dataset 

The  goal  of  the  January  2017  Hackathon  was  to  demonstrate  the  capabilities  of  probabilistic 
programming  on  a  variety  on  four  data  analysis  tasks:  (i)  imputation,  (ii)  modeling  the  effect  of 
intervention  variables  on  outcome  variables,  (iii)  modeling  missing  data  patterns  using  CrossCat 
and  BayesDB,  and  (iv)  model  criticism  and  refinement.  The  original  dataset  contained  roughly 
80  parallel  time  series,  with  many  missing  values. 

Task  1:  Imputation  of  missing  values 

Create  a  joint  model  of  the  data  and  then  sample  from  the  joint  posterior  over  the  missing 
values. 

For  this  task,  probabilistic  programming  allowed  us  to  quickly  try  different  modeling 
assumptions,  as  well  as  different  assumptions  about  which  parts  of  the  data  may  be  relevant  for 
imputing  any  given  missing  value.  Our  team  was  able  to  show  results  for  three  approaches:  (i) 
Bayesian  time  series  models  fit  independently  for  each  variable  for  each  country;  (ii) 

Independent  CrossCat  models  for  each  year;  and  (iii)  CrossCat  models  over  sets  of  adjacent 
years,  which  can  discover  which  cross-sectional  or  temporal  dependencies  are  relevant  for 
prediction. 

Task  2:  Modeling  the  Effect  of  Intervention  Variables  on  Outcome  Variables 

Fit  models  of  the  outcome  variables  as  a  function  of  intervention  variables  (and  potential 
confounding  variables  as  appropriate)  to  generate  causal  hypotheses  about  the  effects  of 
intervention  variables  on  outcome  variables. 

On  this  task,  our  team  was  able  to  both  (i)  detect  the  existence  of  predictive  relationships,  as  well 
as  (ii)  quantify  their  probable  strength.  We  also  hypothesized  the  existence  of  latent  variables 
derived  from  the  model  structures  and  then  tested  whether  they  could  be  used  to  decouple  the 
relationship  between  a  given  intervention  and  outcome  variable. 

Task  3:  Modeling  Missing  Data  Patterns  using  CrossCat  and  BayesDB 

Determine  cases  where  the  missing  values  are  not  missing  at  random. 

For  this  task,  data  of  170  countries  and  51  variables  (socioeconomic  indicators)  over  51  years 
was  provided.  Some  entries  were  missing.  The  purpose  of  this  task  is  to  determine  whether  the 
missing  values  can  safely  be  treated  as  random,  or  whether  there  are  values  missing  not  at 
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random  (MNAR).  If  there  were  values  MNAR,  we  were  asked  to  fit  a  model  conditioned  on  the 
observation  of  "missingness." 

By  building  a  CrossCat  generative  probabilistic  model  of  "missingness"  in  6  cross-sections  of 
time  between  1960  to  2010,  we  were  able  to  expose  clusters  of  variables  and  clusters  of 
countries,  to  reveal  the  rich  set  of  relationships  that  explain  "missingness".  Specifically,  we 
identified  three  apparent  causal  processes  on  which  the  missing  data  patterns  depend  in  a  cross- 
section  of  time:  (1)  geographic/economic  properties  of  the  countries,  (2)  processes  related  to  data 
collection,  and  (3)  processes  related  to  violent  conflict  /  civil  unrest. 


1 .  Geographic  and  economic  patterns.  The  first  cluster  of  missing  variables  is  a 
cluster  in  which  some  —  approximately  1/5  to  3/5  —  of  values  are  missing.  We 
found  that  this  pattern  of  missingness  depends  on  the  geographic  and  economic 
category  of  the  country.  For  example.  Western  Democracies  are  more  likely  to 
have  similar  missing  values  to  one  another  than  to  developing  countries  in 
Africa.  For  African  countries,  the  converse  is  true. 

2.  Data  collection  patterns.  The  second  cluster  of  variables  in  2000  —  with 
similar  clusters  observed  across  years  —  is  one  that  contains  variables  sanitation 
and  water  source.  The  analysis  revealed  that  for  almost  all  countries,  a  country 
has  both  of  these  missing  or  neither  missing.  More  rarely  does  a  country  have 
one  missing  but  not  the  other.  The  countries  included  do  not  have  an  obvious 
economic  or  geographic  pattern  —  instead,  there  appears  to  be  some  technicality 
of  data  collection  or  analysis  that  accounts  for  this  pattern  of  missingness.  The 
interpretation  is  that  these  two  variables  are  collected  by  the  same  agency  or 
using  a  similar  technology.  Corroborating  what  BayesDB  found,  a  web  search 
shows  that  these  two  variables  are  collected  and  analyzed  together. 

3.  Conflict  zones.  The  third  cluster  of  variables  are  those  variables  that,  in  the  vast 
majority  of  countries,  are  "not  missing."  However,  there  is  a  small  number  of 
countries  that  are  notable  exceptions.  Those  countries  appear  to  be  high  conflict 
zones  around  the  year  2000  or  countries  that  have  had  governmental  transitions. 
For  example,  this  cluster  includes  Afghanistan,  Iraq,  South  Sudan,  Serbia,  and 
Montenegro.  In  these  conflict  zones,  more  variables  are  missing  than  normal 
including  some  that  are  almost  always  collected. 
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Table  1:  Clusters  of  variables,  including  raw  values  and  missingness  indicators,  detected  by 
BayesDB.  BayesDB  detected  several  groups  of  variables  that  were  known  to  the  evaluation 
team,  and  found  additional  relationships  as  well. 

Task  4:  Model  Criticism  and  Refinement 

Develop  a  refinement  of  the  model  (or  completely  replace  it)  to  obtain  a  better  model. 

For  this  task,  BayesDB  was  used  to  (i)  study  the  characteristics  and  issues  of  the  baseline  log- 
linear  model  for  maternal_mortality  in  the  3  datasets  and  (ii)  develop  and  characterize  an 
improved  probabilistic  model  for  maternal_mortality,  as  well  as  compare  its  performance  to 
baseline  log-linear  regression. 
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4.3.2  Challenge  Problem  #7  -  Flu  Problem  2017^ 


This  challenge  problem  focused  on  predicting  seasonal  rates  of  influenza-like  illness  (ILI)  in  60 
sub-populations  in  the  US,  ranging  in  size  from  the  individual  counties  to  the  entire  country. 

Data  included  ILI  rate  data  for  each  population,  as  well  as  three  different  kinds  of  covariates  data 
(flu-related  tweets,  vaccination  claims,  and  weather). 

This  problem  is  an  instance  of  a  general  problem  class:  jointly  modeling  hundreds  of  time  series 
with  widely  varying,  non-stationary  dynamics.  Multivariate  time  series  data  is  ubiquitous,  arising 
in  domains  such  as  macroeconomics,  neuroscience,  and  public  health.  Unfortunately, 
forecasting,  imputation,  and  clustering  problems  can  be  difficult  to  solve  when  there  are  tens  or 
hundreds  of  time  series.  One  challenge  in  these  settings  is  that  the  data  may  reflect  underlying 
processes  with  widely  varying,  non-stationary  dynamics.  Another  challenge  is  that  standard 
approaches  based  on  auto-regressive  models,  such  as  AR(p),  ARMA,  ARIMA,  and  GARCH 
models,  can  become  statistically  unstable  and  computationally  intensive.  Auto-regressive  models 
also  require  users  to  know  how  to  choose  from  a  large  set  of  possible  parameter  settings  and 
appropriately  transform  the  data.  Other  technical  issues  that  arise  include  non-linear 
dependencies,  missing  data,  the  presence  of  outliers,  redundant  variables,  and  sparse 
dependencies 

Our  probabilistic  programming  research  under  PPAML  enabled  us  to  develop  and  apply  a 
domain-general  nonparametric  Bayesian  model  for  multivariate  time  series  which  aims  to 
address  some  of  the  above  challenges.  We  implemented  Markov  chain  Monte  Carlo  algorithms 
for  forecasting,  imputation,  and  clustering  based  on  this  model.  The  modeling  approach  is  based 
on  two  extensions  to  Dirichlet  process  mixture  models.  First,  we  introduce  an  extension  to  the 
Chinese  restaurant  process  that  captures  temporal  dependencies  in  the  generated  data.  Second, 
we  introduce  a  hierarchical  extension  that  allows  us  to  cluster  time  series  into  groups  whose 
underlying  dynamics  are  modeled  and  forecast  jointly.  We  used  the  approach  to  forecast  flu 
incidence  rates  in  10  US  regions,  based  on  data  from  the  US  Center  for  Disease  Control  and 
Prevention.  Our  approach  uses  weather  and  social  media  time  series  as  covariates. 

Quantitative  experiments  showed  that  the  proposed  approach  outperforms  several  Bayesian  and 
nonBayesian  baselines,  including  Facebook’s  Prophet  algorithm,  Gaussian  process  models,  and 
seasonal  ARIMA  models.  We  also  show  competitive  imputation  performance  with  state-of-the- 
art  time  series  imputation  techniques  such  as  Amelia  II.  Two  sets  of  predictions  are  shown 
below,  for  the  state  of  Tennessee  and  the  state  of  Texas. 


^  This  section  is  based  on  Saad  and  Mansinghka,  2017. 
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ILI  Rate  in  TN 
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Figure  44.  Influenza-like  illness  rates  in  Tennessee,  including  both  raw  data  and  forecasts  from  the  probabilistic  programs 
developed  for  this  challenge  problem. 
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Figure  45.  Influenza-like  illness  rates  in  Texas,  including  both  raw  data  and  forecasted  rates  from  the  probabilistic  program 
developed  to  solve  this  challenge  problem. 


Our  team  solved  this  problem  by  applying  a  series  of  different  Bayesian  model  discovery 
techniques  using  BayesDB.  This  included  the  standard  Bayesian  model  discovery  methods,  as 
well  as  novel  methods  developed  during  our  PPAML  no-cost  extension.  In  these  models,  the 
dependencies  between  time  series  data  sources  are  mediated  by  shared  cluster  assignments, 
which  ensure  that  all  the  time  series  share  the  same  segmentation  of  the  time  course  into  various 
temporal  regimes.  For  example,  in  one  posterior  sample,  xix  temporal  regimes  describing  the 
seasonal  behavior  shared  among  x_flu  ,  x_tweet ,  and  x_temp  are  shown:  purple,  gray,  and  red 
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are  the  pre-peak  rise,  peak,  and  post-peak  decline  during  the  flu  season;  yellow,  brown,  and 
green  represent  the  rebound  in  between  successive  seasons.  In  2012,  the  model  reports  absence 
of  the  red  post-peak  regime,  reflecting  the  season’s  mild  flu  peak. 


1.  Sample  concentration  parameter  of  CRP 

a  ^  Gamma(l  ,l  ) 

2.  Sample  mot  lei  hyperparametens  {?7,  =  l^  2j . . , ,  iV) 

3.  Sample  distribution  parameters  of  F  (n  =  Ij  2, . . . , 

4.  Assume  first  p  values  are  known  (n  =  1,  2j .  * . ,  jY) 

x2p+i:0~{*2p+i,...,4) 

5.  Sample  time  series  observations  (t  =  1,2..,.) 

5.1  Sample  temporal  cluster  assignment  Zi 

Pr[zt  =  k  I 

a  CRP{fc|a,Zi,t_,)n;Li 

where  :=  |  1  <  <  t} 

and  k  =  L  . ,  .  ,  max  H-  1 

5.2  Sample  data  (p  =  1,2,...,  N) 


Incidence  ol  Influenza- Like  llness  {%) 


Minimjm  Temperature  (F) 


(b)  Discovering  Hu  season  dynamics  with  the  TCCRP 


(a)  Generative  process  for  the  multi  vaiiate  TCCRP  mix  tine 

Figure  46.  (a)  Hierarchical  Bayesian  model  for  the  joint  distribution  of  N  dependent  time  series  {xn},  using  a  multivariate 
temporally-coupled  CRP  mixture,  (b)  example  latent  structure  for  data  from  the  Flu  challenge  problem,  showing  how  the 
discovered  time  series  models  break  down  thein  put  data  into  clusters.  (Saad  and  Mansinghka,  2017). 
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(a)  Foreca^sting  the  2015  flu  season  in  US  Region  (i.  In  these  two  representative  weeks  (top:  week  2014.51,  bottom:  week 
2015.10),  the  multivariate  TCCRP  mixture  accurately  forecasts  both  the  pre-  and  post-peak  seasonal  dynamics  (right-most 
column),  whereas  state-of-the  art  baselines  demonstrate  inaccurate  forecasts  and/or  miscalibrated  uncertainties. 
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(b)  Mean  absolute  flu  prediction  error  for  ten  forecast  horizons  (in  weeks)  averaged  over  10  populations 

h  =  I  h  =  2  /r  =  3  h  =  4  /?  =  5  /?  =  6  h  =  7  h  =  S  h  =  9  h  =  10 

Constant  O  ’^3(o.oi)  0*b0(o.o3)  0*b6(o.o3)  0-71  (o.03)  0-75(o.o2)  0-7^9(0.02)  0.82(o.o2)  0.85(o.o2)  0-87(o.o2)  0.89(o.o2) 

Linear  Extrapolation  0.65(o.o6)  0.79(o.o5)  0.93(o.o5)  L08(o.o5)  L^d^o.os)  L39(o.o5)  l*35(o.o5)  l*70(o.o5)  L86(o.o5)  2*01(o.o5) 

GF  (SE+FER+WN)  0.53(o.o4)  0.60(o.o3)  0-66(o.o3)  0.71(o.o3)  0-75(o.o2)  0.79(o.o2)  0.82(o.o2)  0-85(o.o2)  0-87(o.o2)  0.89(o.o2) 

GF  (SEx FER+W N)  0.50(o.o4)  0.57(o.o3)  0.62(o.o3)  0.67(o.o2)  0-71(o.o2)  0.74(o.o2)  0.78(o.o2)  0.81(o.o2)  0-34(o.o2)  0. 86(0.02) 

Facebook  Prophet  [31]  0.83(o.o4)  0.84(o.o3)  0.85(o.o2)  0.85(o,o2)  0.85(o.o2)  0. 86(0.02)  0.86(o.o2)  0.87(o.o2)  0.87(o.oi)  0.87(o.oi) 

Seasonal  ARIMA  [13]  0.64(o.o4)  0.76(o.o3)  0.84(o.o3)  0.92(o.o3)  0.98(o.o3)  L04(o.o2)  108(o.o2)  113(o.o2)  L16(o.o2)  l-19(o.o2) 

Univariate  TCCRP  0.54(o.o4)  0.58(o.o3)  0.62(o.o2)  0.67(o.o2)  0.71(o.o2)  0.76(o.o2)  0.80(o.o2)  0.83(o.o2)  0. 86(0. 02)  0.89(o.o2) 

Multivariate  TCCRP  0.46(o.o3)  0-49(o.o2)  0-51(o.o2)  0-53(o.o2)  0  56(o.o2)  0-58(o.o2)  0-59(o.oi)  0  61(o.oi)  0-f>2(o.oi)  0-f>4(o.oi) 


Figure  47.  Quantitative  evaluation  of  forecasting  performance  using  several  simple  and  state-of-the-art  baselines  (a)  Examples  of 
pre  and  post-the  peak  forecasts  in  US  Region  6.  (b)  Mean  prediction  errors  and  one  standard  error  for  short,  medium,  and  long¬ 
term  forecast  horizons  averaged  over  Regions  1  through  10.  Predictive  improvement  of  the  multivariate  TCCRP  mixture  over 
baseline  methods  is  especially  apparent  at  higher  forecast  horizons  (Saad  &  Mansinghka,  2017). 


4.4  Example  results  for  our  team  challenge  problem  of  analyzing  a  database  of  Earth 
satellites  using  BayesDB. 

One  of  the  team  challenge  problems  over  the  course  of  the  PPAML  program  was  focused  on 
providing  computer-aided  judgment  for  ‘medium  data’  problems,  and  whether  it  was  possible  to 
make  statistical  inference  broadly  accessible  to  non-statisticians  without  sacrificing  mathematical 
rigor  or  inference  quality.  We  tested  these  capabilities  by  applying  BayesDB  to  a  public  database 
of  Earth  satellites  and  measuring  the  runtime  and  accuracy  of  a  broad  class  of  queries. 

This  section  outlines  a  case  study  applying  compositional  generative  population  models 
(CGPMs)  in  BayesDB  to  a  population  of  satellites  maintained  by  the  Union  of  Concerned 
Scientists.  The  dataset  contains  1163  entries,  and  each  satellite  has  23  numerical  and  categorical 
features  such  as  its  material,  functional,  physical,  orbital  and  economic  characteristics.  We 
construct  a  hybrid  CGPM  using  an  MML  metamodel  definition  which  combines  (i)  a  classical 
physics  model  written  as  a  probabilistic  program  in  Ventures cript,  (ii)  a  random  forest  to  classify 
a  nominal  variable,  (iii)  an  ordinary  least  squares  regressor  to  predict  a  numerical  variable,  and 
(iv)  principal  component  analysis  on  the  real -valued  features  of  the  satellites.  These  CGPMs 
allow  us  to  identify  satellites  that  probably  violate  their  orbital  mechanics,  accurately  infer 
missing  values  of  anticipated  lifetime,  and  visualize  the  dataset  by  projecting  the  satellite 
features  into  two  dimensions. 
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Data  for  Compass  M4 


0 

Name 

Compass  M4  (Beidou  2-13) 

Country  of  Operator 

China  (PR) 

Operator  Owner 

Chinese  Defense  Ministry 

Users 

Military 

Purpose 

Navigation/Global  Positioning 

Class  of  Orbit 

MEO 

Type  of  Orbit 

NaN 

Perigee  km 

21452 

Apogee  km 

21603 

Eccentricity 

0.00271 

Period  minutes 

773.21 

Launch  Mass  kg 

2200 

Dry  Mass  kg 

NaN 

Power  watts 

NaN 

Date  of  Launch 

41027 

Anticipated  Lifetime 

6 

Contractor 

Space  Technology  Research  Institute  (part  of  C... 

Country  of  Contractor 

China  (PR) 

Launch  Site 

Xichang  Satellite  Launch  Center 

Launch  Vehicle 

Long  March  3B 

Source  Used  for  Orbital  Data 

ZARYA 

longitude  radians  of  geo 

NaN 

Inclination  radians 

0.961676 

Columns:  descriptive  variables 


Red  represents  missing  data 


Rows: 

Satellites 


Figure  48.  Example  application:  Earth-orbiting  satellites.  The  right  panel  shows  a  schematic  of  the  dataset,  released  publicly  by 
the  Union  of  Concerned  Scientists.  The  left  panel  shows  a  single  example  record  corresponding  to  a  Chinese  Defense  Ministry 
satellite. 


%mml  CREATE  TABLE  satellites 
FROM  satellites. CSV; 


%mml  CREATE  POPULATION  SCHEMA  FOR  satellites 
FROM  satellites_data 
WITH  SCHEMA  ( 

GUESS  STATTYPES  FOR  (*); 

...  ); 


%mml  INITIALIZE  64  POPULATION  MODELS  FOR 

satellites  USING  MODELING  TACTIC  crosscat; 

%mml  IMPROVE  MODELS  FOR  satellites  FOR  4  MINUTES; 

Figure  49.  Example  probabilistic  code  written  in  BayesDB  for  Bayesian  model  discovery  from  this  satellites  database. 
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CREATE  TABLE  satellites_ucs  FROM  ' satellites . csv" 

.nullify  satellites_ucs  'NaN^ 

CREATE  POPULATION  satellites  FOR  satellites_UCS  WITH  SCHEMA  ( 


IGNORE  Name; 

GUESS  STATTYPES  FOR  *: 


); 

CREATE 

MODEL  longitude_radians_of_geo,  inclination_radians  AS  CYCLIC; 

Customize  statistical  types 

POPULATION  SCHEMA  sat hybrid  FOR  satellites  WITH  BASELINE  crosscat( 

OVERRIDE  GENERATIVE  MODEL 

FOR  launch_mass_kg,  dry_mass_kg,  power_watts,  perigee_km,  apogee_km 
EXPOSE  pci  NUMERICAL,  pc2  NUMERICAL 

USING  factor analysis(L=2); 

Factor  analysis  with  exposed 
factors 

OVERRIDE  GENERATIVE  MODEL 

FOR  anticipated_lifetime 

GIVEN  date_of_launch,  power_watts,  apogee_km,  perigee_km,  dry_mass_kg,  class_of_orbit 
USING  linear regression; 

Linear  regression 
to  predict  lifetime 

); 

OVERRIDE  GENERATIVE  MODEL  FOR  type_of_orbit 

GIVEN  apogee_km,  perigee_km,  period_minutes,  users,  class_of_orbit 
USING  random_forest(k=7); 

Random  forests  to  classify 
orbit  types 

INITIALIZE  16  POPULATION  MODELS  FOR  satellites; 

IMPROVE  MODELS  FOR  4  MINUTES; 

Figure  50.  Example  probabilistic  code  written  in  BayesDB  for  improving  over  baseline  Bayesian  model  discovery  by  integrating 
custom  models.  This  demonstrates  a  unique  capability  of  BayesDB:  combining  baseline  probabilistic  programs  that  were 
synthesized  by  an  automatic  mechanism  with  custom  probabilistic  code  written  by  an  end  user. 


"What  are  the  satellites  in  geosynchronous  orbit  with  the  most  unlikely  orbital  periods?" 


ESTIMATE  name,  class_of_orbit ,  period_minutes ,  PROBABILITY  OF  period_minutes 

FROM  satellites 

WHERE  class_of_orbit  =  GEO 

ORDER  BY  PROBABILITY  OF  period_minutes  ASCENDING  LIMIT  10 


Name 

Class_of_Orbit 

Period_minutes 

Relative  Probability  of  Period 

0 

AEHF-3  (Advanced  Extremely  High  Frequency  sate... 

GEO 

1306.29 

0.001295 

1 

AEHF-2  (Advanced  Extremely  High  Frequency  sate... 

GEO 

1306.29 

0.001295 

1 

2 

DSP  20  (USA  149)  (Defense  Support  Program) 

GEO 

142.08 

0.002638 

1 

3 

Intelsat  903 

GEO 

1436.16 

0.003249 

4 

BSAT-3B 

GEO 

1365.61 

0.003418 

5 

Intelsat  902 

GEO 

1436.10 

0.003443 

6 

SDS  III-6  (Satellite  Data  System)  NRO  L-27,  Gr... 

GEO 

14.36 

0.003735 

7 

Advanced  Orion  6  (NRO  L-15,  USA  237) 

GEO 

23.94 

0.003863 

8 

SDS  III-7  (Satellite  Data  System)  NRO  L-38,  Dr... 

GEO 

23.94 

0.003863 

9 

QZS-1  (Quazi-Zenith  Satellite  System,  Michibiki) 

GEO 

1436.00 

0.004522 

Our  explanation: 
off  by  lOx; 
not  an  outlier 


Our  explanation: 
Meant  24  hours 
not  24  minutes 


Figure  51.  An  example  query  in  BayesDB 's  Bayesian  Query  Language  (BQL)  that  identifies  probably  anomalous  satellites,  along 
with  example  results. 


Traditional  databases  protect  consumers  of  data  from  “having  to  know  how  the  data  is  organized 
in  the  machine”  Codd  (1970)  and  provide  automated  data  representations  and  retrieval 
algorithms  that  perform  well  enough  for  a  broad  class  of  applications.  Although  this  abstraction 
barrier  is  only  imperfectly  achieved,  it  has  proved  useful  enough  to  serve  as  the  basis  of  multiple 
generations  of  software  and  data  systems.  This  decoupling  of  task  specification  from 
implementation  made  it  possible  to  improve  performance  and  reliability  —  of  individual 
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query  runtime,  new  (seconds) 


database  indexes,  and  in  some  cases  of  entire  database  systems  —  without  needing  to  notify  end 
users.  It  also  created  a  simple  conceptual  vocabulary  and  query  language  for  data  management 
and  data  processing  that  spread  far  farther  than  the  systems  programming  knowledge  needed  to 
implement  it. 


INFER  apoge€_km 
USING  %d  SAMPLES 
FROM  satellites  p 
WHERE  rowid  s  7 


INFER  EXPLICIT  launch  mass  kg. 
PREDICT  dry_mass_kg  USING  %6  SAMPLES 
FROM  satellites  p 

WHERE  launch  mass  kg  IS  NOT  NULL 
AND  dry” mass'kg  IS  NULL 


query  runtime,  old  (seconds) 


SIMULATE  class_of_orbit 
FROM  satelTites_p 
GIVEN  anticipated  lifetime  =  3 
LIMIT  %d 


SIMULATE  country_of_operator,  purpose 
FROM'satellites_p 
GIVEN  class_of  orbit  =  ’LEO* 
LIMIT’'%d 


SIMULATE  users 
FROM  satellites_p 
GIVEN  anticipated  lifetime  =  11 
LIMIT  %d 


10»  10^ 
query  runtime,  old  (seconds) 


10* 


query  runtime,  old  (seconds) 


query  runtime,  old  (seconds) 


10° 


Figure  52.  Query  performance  results  on  the  Satellites  database,  showing  improvement  in  runtime  for  multiple  queries  between 
our  first  submission  and  second  submission.  These  results  quantitatively  show  that  we  were  able  to  significantly  improve  the 
speed  of  BayesDB  over  the  course  of  the  PPAML  program. 


4.5  Example  results  showing  how  probabilistic  programming  can  be  used  to 
reimplement  the  "Automatic  Statistician"  in  under  70  lines  of  code. 


Another  challenge  problem  we  focused  on,  described  in  our  initial  proposal,  is  showing  that 
probabilistic  programming  can  be  used  to  re-implement  state-of-the-art  techniques  for 
discovering  symbolic  structure  in  data.  The  system  we  chose  to  reimplement  is  called  the 
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Automated  Statistician.  This  system  was  developed  at  Cambridge  University  is  described  as 
follows  on  its  website: 

“Making  sense  of  data  is  one  of  the  great  challenges  of  the  information  age  we  live  in.  While  it  is 
becoming  easier  to  collect  and  store  all  kinds  of  data,  from  personal  medical  data,  to  scientific 
data,  to  public  data,  and  commercial  data,  there  are  relatively  few  people  trained  in  the  statistical 
and  machine  learning  methods  required  to  test  hypotheses,  make  predictions,  and  otherwise  create 
interpretable  knowledge  from  this  data.  The  Automatic  Statistician  project  aims  to  build  an 
artificial  intelligence  for  data  science,  helping  people  make  sense  of  their  data. 

Kevin  P.  Murphy,  Senior  Research  Scientist  at  Google  says:  "In  recent  years,  machine  learning 
has  made  tremendous  progress  in  developing  models  that  can  accurately  predict  future  data. 
However,  there  are  still  several  obstacles  in  the  way  of  its  more  widespread  use  in  the  data 
sciences.  The  first  problem  is  that  current  Machine  Learning  (ML)  methods  still  require 
considerable  human  expertise  in  devising  appropriate  features  and  models.  The  second  problem 
is  that  the  output  of  current  methods,  while  accurate,  is  often  hard  to  understand,  which  makes  it 
hard  to  trust.  The  "automatic  statistician"  project  from  Cambridge  aims  to  address  both  problems, 
by  using  Bayesian  model  selection  strategies  to  automatically  choose  good  models  / features,  and 
to  interpret  the  resulting  fit  in  easy-to-understand  ways,  in  terms  of  human  readable,  automatically 
generated  reports. " 

The  figure  below  gives  an  example  input  and  output  from  the  Automated  Statistician,  applied  to  a 
database  of  airline  passenger  counts  spanning  multiple  years. 


200  ... 

1»50  1852  1954  1850  IS  *00 

200  'OO 

1850  1852  1854  181 

60 


100  ....... 

1850  1852  1854  1850  1950  1800  1802 


Figure  53.  The  passenger  counts  can  be  decomposed  into  a  linearly  increasing  trend,  a  periodic  overlay  (with  amplitude  that 
increases  over  time),  and  a  set  of  local  stochastic  deviations  from  the  main  trend. 
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The  core  probabilistic  inference  technique  behind  these  AI  capabilities  are  described  as  follows 
in  (Duvenaud  et  ah,  2013),  the  original  paper  that  led  to  the  Automatic  Statistician: 

Abstract:  Despite  its  importance,  choosing  the  structural  form  of  the  kernel  in  nonparametric 
regression  remains  a  black  art.  We  define  a  space  of  kernel  structures  which  are  built 
compositionally  by  adding  and  multiplying  a  small  number  of  base  kernels.  We  present  a  method 
for  searching  over  this  space  of  structures  which  mirrors  the  scientific  discovery  process.  The 
learned  structures  can  often  decompose  functions  into  interpretable  components  and  enable 
long-range  extrapolation  on  time-series  datasets.  Our  structure  search  method  outperforms 
many  widely  used  kernels  and  kernel  combination  methods  on  a  variety  of  prediction  tasks. 

Over  the  course  of  PPAML,  we  showed  how  to  reimplement  and  extend  the  Automatic 
Statistician  using  the  Venture  probabilistic  programming  language.  The  key  technical  ideas  are 
to  (i)  represent  models  using  abstract  syntax  trees  for  a  domain-specific  probabilistic  language, 
and  (ii)  represent  the  time  series  model  prior,  likelihood,  and  search  strategy  using  probabilistic 
programs  in  a  sufficiently  expressive  language.  The  final  probabilistic  program  is  written  in 
under  70  lines  of  probabilistic  code  in  Venture.  Below,  we  demonstrate  an  application  to  time 
series  clustering  that  involves  a  non-parametric  extension  to  ABCD,  experiments  for 
interpolation  and  extrapolation  on  real-world  econometric  data,  and  improvements  in  accuracy 
over  both  non-parametric  and  standard  regression  baselines. 

We  formulated  structure  discovery  as  a  form  of  “probabilistic  program  synthesis”.  The  key  idea 
is  to  represent  probabilistic  models  using  abstract  syntax  trees  (ASTs)  for  a  domain-specific 
language,  and  then  use  probabilistic  programs  to  specify  the  AST  prior,  model  likelihood,  and 
search  strategy  over  models  given  observed  data. 

Several  recent  projects  have  applied  probabilistic  programming  techniques  to  Gaussian  process 
time  series.  Schaechtle  et  al.  (2015)  embed  GPs  into  Venture  with  fully  Bayesian  learning  over  a 
limited  class  of  covariance  structures  with  a  heuristic  prior.  Tong  and  Choi  (2016)  describe  a 
technique  for  learning  GP  covariance  structures  using  a  relational  variant  of  ABCD,  and  then 
compile  the  models  into  Stan  (2017).  However,  probabilistic  programming  is  only  used  for 
prediction,  not  for  structure  learning  or  for  hyperparameter  inference. 

The  contributions  of  the  ABCD  approach  we  took  are  as  follows.  First,  we  formulate  ABCD  as 
probabilistic  program  synthesis.  Second,  our  implementation  supports  combinations  of  gradient- 
based  search  for  hyperparameters,  and  Metropolis-Hastings  sampling  for  structure  and 
hyperparameters.  Third,  we  showed  competitive  performance  on  extrapolation  and  interpolation 
tasks  from  real-world  data  against  several  baselines.  Fourth,  we  showed  that  10  lines  of  code  are 
sufficient  to  extend  ABCD  into  a  nonparametric  Bayesian  clustering  technique  that  identifies 
time  series  which  share  covariance  structure. 
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Data  structure 
representing 
probabilistic  program 


Probabilistic  program 
impiementing 
Gaussian  process 
modei 


Three  simulated 
realizations 


assume  covariance_kemel  = 
gp_cov_cp( 

4.5,  .1. 
gp_cov_product ( 
gp_cov_linear( . 36) , 
gp_cov_scale( 

2.05,  gp_cov_delta)) , 
gp_cov_scale( 

1 1 ,  gp_cov_delta) ) ; 
assume  gp  =  makG_gp( 
gp_mean_const(0 . ) , 
covariance_kernel) ; 


Figure  54.  Executing  synthesized  model  programs  to  produce  Gaussian  process  datasets.  Left:  A  symbolic  structure  generated  by 
the  AST  prior.  Center:  Equivalent  source  code  of  the  Venture  GP  model  program,  produced  by  the  AST  interpreter.  Right: 
Executions  of  the  model  program  probed  with  inputs  in  the  region  [0,  10],  which  outputs  datasets  of  GP  time  series.  (Schaechtle 
et  al.,  2017). 
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1 

2 

3 

4 

5 
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7 


Choosing  kernel  types  I 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

Choosing  composition  22 
rules  24 

25 

26 

27 

28 

29 

30 

31 

32 

Generating  34 
probabilistic  program  3^ 
source  code  H 

39 


assume  tree_root  =  ()  ->  {1}; 

assume  get_hyper_prior  ~  memC (node.index)  ->  { 

//  Gradient-safe  exponential  prior. 

-log_logistic(log_odds_uniform()  fhypers :node_index) 

}): 

assume  choose_primitive  =  (node)  ->  { 
base_kemel  ~  categorical (simplex(. 2,  .2,  .2,  .2.  .2). 

["WN",  ’'C".  "LIN",  "SE".  "PER"])  #structure:pair("base_kemel" ,  node); 
condC 

(base_kemel  ==  "WN")  (["WN",  get_hyper_prior(pair("WN",  node))]), 

(basejcemel  ==  "C")  (["C",  get_hyper_prior(pair("C" ,  node))]), 

(basejcemel  ==  "LIN")  (["LIN",  get_hyper_prior(pair("LIN" ,  node))]), 
(base_kemel  ==  "SE")  (["SE",  .81  +  get_hyper_prior(pair("SE" ,  node))]), 

(basejcemel  ==  "PER")  (["PER", 

.81  +  get Jiyper_prior (pair ("PER_1",  node)), 

.81  +  get Jiyper .prior (pair("PER_t",  node)) 

])) 


assume  choose.operator  =  mem((node)  ->  { 
operator.symbol  ~  categorical(si]q;}lex(8.45,  8.45,  8.1), 

["+",  "*",  "OP"])  #structure:pair("operator" ,  node); 
if  (operator.symbol  =  "CP")  { 

[operator.symbol ,  getJiyper_prior(pair("CP" ,  node))] 

}  else  { 
operator.symbol 

} 

}); 

assume  generate_randora_prograin  =  mem((node)  ->  { 
if  (flip(.3)  #structure : pair ( "branch" ,  node))  { 
operator  -  c±oose_operator(node) ; 

[operator,  generate_random_program(2  *  node),  genera te_random_prograffl( 2  *  node  +  1)] 
}  else  { 

choose_primitive(node) 

} 

}); 


Eigure  55.  Synthesis  model:  AST  prior  G  -  Structure  discovery  in  time  series  via  probabilistic  program  synthesis.  (Schaechtle  et 
al.,  2017).This  fragment  of  probabilistic  code  shows  how  to  define  a  generative  model  over  probabilistic  programs. 
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Defining  the 
model  and 
loading  data 


assume  source  ~  generate_randoni_program(tree_root()) ; 
assume  gp_executable  =  produce_executable ( source) ; 
define  xs  =  get_data_xs("./data.csv") ; 
define  ys  =  get_data_ys(" ./data.csv") ; 
observe  gp_executable(${xs})  =  ys; 


Custom 

inference 

strategy 


for_each(arange(T) ,  (_)  ->  { 
infer  gradient ( 

minimal_subproblem(/?hypers) ,  steps=lQ0) ; 
infer  resimulate ( 

minimal_subproblem(one (/?structure) ) ,  steps=10®) }) 


Figure  56.  Source  code  for  the  Automatic  Statistician  in  Venture:  data  observation  program  (top)  and  synthesis  inference 
strategy:  MH  +  Gradients  (bottom)  (Schaechtle  et  ah,  2017).  This  code  shows  how  to  load  data  and  use  a  custom  inference 
strategy  to  solve  the  problem  of  discovering  symbolic  structure  in  time  series  data. 
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Figure  57.  Qualitative  results  for  extrapolation  and  interpolation  tasks.  The  figure  above  shows  extrapolation  performance  on  a 
dataset  of  airline  passenger  volume  between  1949  and  1960.  The  probabilistic  program  detects  the  linear  trend  with  periodic 
variation,  leading  to  very  accurate  predictions.  In  contrast,  Bayesian  linear  regression  is  forced  to  treat  such  structural  effects  as 
unmodeled  noise_(Schaechtle  et  ah,  2017). 
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%%venturescript 
sample  source 
[  •  +  ’> 

[‘WN',  15.8436], 

[  •  +  ■> 

['PER',  5.8919,  1.9472], 
['LIN',  6.7083]]] 


WN(15.8) 


PER(5.9,  1.9) 


LIN(6.7) 


%%venturescript 
sample  source 
[ 

[■SE*,  2.2730], 

[ 

[‘PER’,  6.5189,  6.4565], 

[  ■  +  ’> 

[•WN',  11.4031], 

['PER',  1.3393,  8.3904]]]] 


PKR(6.5,  f>.5) 


PER(  1.3.  8.4) 


%%venturescript 
sample  source 
[  ['CP',  9.4905], 

['WN',  13.4885], 
[’LIN’,  5.3516]] 


WN(I3.5) 


LIN(5.4) 


—  GP  samples 
»»ii  Observations 

.«■  Held-out  data 

5. 

*  , 

*  “«  « — 

1950  1952  1954  1956  1958  1960  1962 

Year 


i  -200 


—  GP  samples 
««i«  Observations 

«■«  Held-out  data 

*  n 

*  ^  ^  : 
jfc  all  - 

1 


1950  1952  1954  1956  1958  1960  1962 

Year 


—  GP  samples 
Observations 
««i.  HeM-out  data 


iVV^ 


Figure  58.  Samples  from  prior  for  Airline  example.  Top:  Domain  specific  language  representation  of  three  symbolic  structures 
generated  by  the  AST  prior.  Middle:  tree-representation  of  said  symbolic  structures.  Bottom:  executions  of  the  model  program 
(green)  probed  with  in  regimes  where  data  was  observed  (black  data  points)  and  in  regimes  where  data  was  held  out  (red  data 
points).  The  executions  of  the  model  programs  resulting  from  the  three  different  symbolic  structures  model  different  possible 
dynamics  in  the  data.  The  left  program  represents  a  linear  trend  with  a  smooth  and  slightly  periodic  component  plus  small  white 
noise.  The  center  programs  models  a  smooth  curve.  The  right  program  implements  a  changepoint  model. 


Figure  59.  Samples  for  approximate  posterior  for  Airline  example.  Top:  Domain  specific  language  representation  of  three 
symbolic  structures  generated  by  the  approximate  AST  posterior.  Middle:  tree-representation  of  said  symbolic  structures. 
Bottom:  executions  of  the  model  program  (green)  probed  with  in  regimes  where  data  was  observed  (black  data  points)  and  in 
regimes  where  data  was  held  out  (red  data  points).  A  comparison  of  the  executions  with  observed  and  held  out  data  indicates 
good  inter-  and  extrapolation  performance  for  the  resulting  programs  given  all  of  the  three  sampled  structures. 
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Probabilistic  program  synthesis 

ABCD 

Main  system 

69 

4,166 

Gaussian  process  libraries 

2,164  (Venture  gpmem) 

13,945  (GPML  Toolbox) 

Generic  inference  implementation 

1,887  (Venture) 

— 

Figure  60.  Our  probabilistic  program  implements  functionality  from  the  “Automated  Statistician"  but  using  ~100x  fewer  lines  of 
code.  (Schaechtle  et  al.,  2017). 


Above,  we  have  described  and  implemented  a  framework  for  time  series  structure  discovery 
using  probabilistic  program  synthesis.  We  also  have  assessed  efficacy  of  the  approach  on 
synthetic  and  real-world  experiments,  and  demonstrated  improvements  in  model  discovery, 
extensibility,  and  predictive  accuracy.  It  seems  promising  to  apply  probabilistic  program 
synthesis  to  several  other  settings,  such  as  fully-Bayesian  search  in  compositional  generative 
grammars  for  other  model  classes  (Grosse  et  al.,  2012),  or  Bayes  net  structure  learning  with 
structured  priors  (Mansinghka  et  al.,  2006).  We  hope  the  formalisms  in  herein  encourage  broader 
use  of  probabilistic  programming  techniques  to  learn  symbolic  structures  in  other  applied 
domains. 

4.6  Example  results  showing  how  probabilistic  programming  can  be  used  to  solve 
problems  of  visual  scene  understanding. 

Recent  progress  on  probabilistic  modeling  and  statistical  learning,  coupled  with  the  availability 
of  large  training  datasets,  has  led  to  remarkable  progress  in  computer  vision.  Generative 
probabilistic  models,  or  “analysis-by-synthesis”  approaches,  can  capture  rich  scene  structure  but 
have  been  less  widely  applied  than  their  discriminative  counterparts,  as  they  often  require 
considerable  problem-specific  engineering  in  modeling  and  inference,  and  inference  is  typically 
seen  as  requiring  slow,  hypothesize-and-test  Monte  Carlo  methods.  In  this  section,  we  describe 
how  Picture  (Kulkarni  et  al,  2015[1]),  a  probabilistic  programming  language  for  scene 
understanding  that  allows  researchers  to  express  complex  generative  vision  models,  while 
automatically  solving  them  using  fast  general-purpose  inference  machinery. 

Picture  provides  a  stochastic  scene  language  that  can  express  generative  models  for  arbitrary 
2D/3D  scenes,  as  well  as  a  hierarchy  of  representation  layers  for  comparing  scene  hypotheses 
with  observed  images  by  matching  not  simply  pixels,  but  also  more  abstract  features  (e.g., 
contours,  deep  neural  network  activations).  Inference  can  flexibly  integrate  advanced  Monte 
Carlo  strategies  with  fast  bottom-up  data-driven  methods.  Thus  both  representations  and 
inference  strategies  can  build  directly  on  progress  in  discriminatively  trained  systems  to  make 
generative  vision  more  robust  and  efficient.  We  used  Picture  to  write  programs  for  3D  face 
analysis,  3D  human  pose  estimation,  and  3D  object  reconstruction  -  each  competitive  with 
specially  engineered  baselines. 


Approved  for  Public  Release;  Distribution  Unlimited. 


Probabilistic  scene  understanding  systems  aim  to  produce  high-probability  descriptions  of  scenes 
eonditioned  on  observed  images  or  videos,  typically  either  via  discriminatively  trained  models  or 
generative  models  in  an  “analysis  by  synthesis”  framework.  Discriminative  approaches  lend 
themselves  to  fast,  bottom-up  inference  methods  and  relatively  knowledge-free,  data-intensive 
training  regimes,  and  have  been  remarkably  successful  on  many  recognition  problems 
(Felzenszwalb  et  al.,  2010;  Krizhevsky  et  al.,  2012;  LuCun  and  Bengio,  1995;  Lowe  2004). 
Generative  approaches  hold  out  the  promise  of  analyzing  complex  scenes  more  richly  and 
flexibly  (Grenander  1993;  Grenander,  Chow,  and  Keenan,  1991;  Zhao  and  Zhu,  2011;  Del  Pero 
et  al.,  2013;  Jampani  et  al.,  2014;  Loper  and  Black,  2014;  Mansinghka  et  al.,  2013[9];  Hinton, 
Osindero,  and  Teh,  2006;  Jin  and  Geman,  2006),  but  have  been  less  widely  embraeed  for  two 
main  reasons:  Inference  typically  depends  on  slower  forms  of  approximate  inference,  and  both 
model-building  and  inferenee  ean  involve  eonsiderable  problem-speeific  engineering  to  obtain 
robust  and  reliable  results.  These  factors  make  it  difficult  to  develop  simple  variations  on  state- 
of-the-art  models,  to  thoroughly  explore  the  many  possible  eombinations  of  modeling, 
representation,  and  inference  strategies,  or  to  richly  integrate  complementary  discriminative  and 
generative  modeling  approaehes  to  the  same  problem.  More  generally,  to  handle  inereasingly 
realistic  scenes,  generative  approaches  will  have  to  scale  not  just  with  respect  to  data  size  but 
also  with  respect  to  model  and  seene  eomplexity.  This  sealing  will  arguably  require  general- 
purpose  frameworks  to  compose,  extend  and  automatically  perform  inference  in  complex 
structured  generative  models  -  tools  that  for  the  most  part  do  not  yet  exist. 

One  of  the  programming  languages  developed  over  the  course  of  the  PPAML  project  was 
Picture,  which  aims  to  provide  a  common  representation  language  and  inference  engine  suitable 
for  a  broad  class  of  generative  scene  perception  problems.  We  see  probabilistic  programming  as 
key  to  realizing  the  promise  of  “vision  as  inverse  graphies”.  Generative  models  can  be 
represented  via  stochastic  code  that  samples  hypothesized  scenes  and  generates  images  given 
those  scenes.  Rich  deterministic  and  stochastic  data  structures  can  express  eomplex  3D  seenes 
that  are  difficult  to  manually  specify.  Multiple  representation  and  inference  strategies  are 
specifieally  designed  to  address  the  main  pereeived  limitations  of  generative  approaches  to 
vision.  Instead  of  requiring  photo-realistic  generative  models  with  pixel-level  matching  to 
images,  we  ean  eompare  hypothesized  scenes  to  observations  using  a  hierarehy  of  more  abstract 
image  representations  such  as  contours,  discriminatively  trained  part-based  skeletons,  or  deep 
neural  network  features.  Available  Markov  Chain  Monte  Carlo  (MCMC)  inferenee  algorithms 
include  not  only  traditional  Metropolis-Hastings,  but  also  more  advanced  techniques  for 
inference  in  high-dimensional  continuous  spaces,  such  as  elliptical  slice  sampling,  and 
Hamiltonian  Monte  Carlo  which  can  exploit  the  gradients  of  automatically  differentiable 
renderers.  These  top-down  inference  approaches  are  integrated  with  bottom-up  and  automatieally 
constructed  data-driven  proposals,  which  can  dramatically  accelerate  inference  by  eliminating 
most  of  the  “burn  in”  time  of  traditional  samplers  and  enabling  rapid  mode-switching.  We 
demonstrate  Picture  on  three  challenging  vision  problems:  inferring  the  3D  shape  and  detailed 
appearanee  of  faees,  the  3D  pose  of  articulated  human  bodies,  and  the  3D  shape  of  medially- 
symmetric  objects.  The  vast  majority  of  code  for  image  modeling  and  inference  is  reusable 
aeross  these  and  many  other  tasks.  We  show  that  Picture  yields  performanee  competitive  with 
optimized  baselines  on  each  of  these  benchmark  tasks. 
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function  PROGRAM (MU,  PC,  EV,  VERTEX_ORDER) 

#  Scene  Language:  Stochastic  Scene  Gen 

face-Dict 0 ; shape  -  (];  texture  -  (]; 

for  S  in  ("shape",  "texture"] 
for  p  in  ("nose",  "eyes",  "outline",  "lips") 
coeff  -  HvNormal (0,1,1,99) 

facets] [p]  -  MU[S] [p]+PC[S] Ip] .. (coeff .*EV(S] [p] ) 
end 
end 

shape»face[ "shape"] ( : ] ;  tex-face ( "texture" ] ( : ] ; 
camera  -  Oniforni(-1, 1, 1, 2) ;  light  -  Onifonn(-1, 1, 1, 2) 

#  Approximate  Renderer 

rendered_img»  MeshRenderer (shape, tex, light, camera) 

#  Representation  Layer 

ren_ftrs  -  getFeatures ("CNN_Conv6",  rendered_img) 

#  Comparator 

fusing  Pixel  as  Summary  Statistics 
observe (MvNormal (0, 0.01) ,  rendered_img-obs_img) 
fusing  CNN  last  conv  layer  as  Summary  Statistics 
observe (MvNormal (0, 10) ,  ren_ftrs-oba_cnn) 
end 

global  oba_img  ■  imread ("teat .png") 

global  oba.,cnn  »  getFeatures  ("CNN_Conv6",  img) 

fLoad  args  from  file 

TR  -  trace (PROGRAM, args- (MU, PC, EV,  VERTEX_ORDER] ) 

#  Data-Driven  Learning 

leam_datadriven_proposals  (TR,  100000,  "CNN  _Conv6") 
load_proposals (TR) 

#  Inference 

infer (TR,CB, 20, ("DATA-DRIVEN"]) 
infer (TR, CB, 200, ( "ELLIPTICAL" ] ) 


Figure  61.  Picture  code  [left]  for  the  3D  face  application,  along  with  a  schematic  description  of  the  dependencies  within  that 
Picture  program  [right].  (Kulkarni  et  al.,  2015. 
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Figure  62.  An  overview  of  the  inference  architecture  from  Picture,  in  which  learned  bottom-up  proposals  can  be  combined  with 
custom  search  operators.  (Kulkarni  et  al.,  2015) 


Observed  Baseline  Picture 


Figure  63.  Quantitative  and  qualitative  results  for  3D  human  pose  program. 
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We  developed  a  Picture  program  for  parsing  3D  pose  of  articulated  humans  from  single  images. 
There  has  been  notable  work  in  model-based  approaches  for  3D  human  pose  estimation,  which 
served  as  an  inspiration  for  the  program  we  describe  in  this  section.  However,  in  contrast  to 
Picture,  existing  approaches  typically  require  custom  inference  strategies  and  significant  task- 
specific  model  engineering.  The  probabilistic  code  consists  of  latent  variables  denoting  bone  and 
joints  of  an  articulated  3D  base  mesh  of  a  body.  In  our  probabilistic  code,  we  use  an  existing 
base  mesh  of  a  human  body,  defined  priors  over  bone  location  and  joints,  and  enable  the 
armature  skin-modifier  API  via  Picture's  Blender  engine  API.  The  latent  scene  Formula  in  this 
program  can  be  visualized  as  a  tree  with  the  root  node  around  the  center  of  the  mesh,  and 
consists  of  bone  location  variables,  bone  rotation  variables  and  camera  parameters.  The 
representation  layer  Formula  in  this  program  uses  fine-grained  image  contours  and  the 
comparator  is  expressed  as  the  probabilistic  chamfer  distance. 

We  evaluated  our  program  on  a  dataset  of  humans  performing  a  variety  of  poses,  which  was 
aggregated  from  KTH  and  LabelMe  images  with  significant  occlusion  in  the  “person 
sittingx(around  50  total  images).  This  dataset  was  chosen  to  highlight  the  distinctive  value  of  a 
graphics  model-based  approach,  emphasizing  certain  dimensions  of  task  difficulty  while 
minimizing  others:  While  graphics  simulators  for  articulated  bodies  can  represent  arbitrarily 
complex  body  configurations,  they  are  limited  with  respect  to  fine-grained  appearance  (e.g.,  skin 
and  clothing),  and  fast  methods  for  fine-grained  contour  detection  currently  work  well  only  in 
low  clutter  environments.  We  initially  used  only  single-site  MH  proposals,  although  blocked 
proposals  or  HMC  can  somewhat  accelerate  inference. 

We  compared  this  approach  with  the  discriminatively  trained  Deformable  Parts  Model  (DPM) 
for  pose  estimation(referred  as  DPM-pose),  which  is  notably  a  2D  pose  model.  Images  with 
people  sitting  and  heavy  occlusion  are  very  hard  for  the  discriminative  model  to  get  right  - 
mainly  due  to  “missing”  observation  signal-while  our  model-based  approach  can  handle  these 
reasonably  if  we  constrain  the  knee  parameters  to  bend  only  in  natural  ways  in  the  prior.  Most  of 
our  model's  failure  cases,  are  in  inferring  the  arm  position;  this  is  typically  due  to  noisy  and  low 
quality  feature  maps  around  the  arm  area  due  to  its  small  size. 
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Figure  64.  Bottom:  Random  program  traces  sampled  from  the  prior  during  training.The  colored  stick  figures  are  the  results  of 
applying  DPM  pose  model  on  the  hallucinated  data  from  the  program.  (Kulkarni  et  al.,  2015)  This  illustrates  the  process  by 
which  Picture  programs  can  learn  bottom-up  proposals. 


Inout  imaae 


Three  best  proposals 


Figure  65.  A  detailed  illustration  of  the  utility  of  learned  bottom-up  proposals  for  inference  in  Picture  programs.  The  top  input 
image,  as  well  as  the  three  best  samples  drawn  from  the  learned  bottom-up  proposals  conditioned  on  the  test  image  are 
semantically  close  to  the  test  image  and  results  are  fine-tuned  by  top-down  inference  to  close  the  gap.  As  shown  on  the  log-1  plot, 
we  run  about  100  independent  chains  with  and  without  the  learned  proposal.  Inference  with  a  mixture  kernel  of  learned  bottom- 
up  proposals  and  single-site  MH  consistently  outperforms  baseline  in  terms  of  both  speed  and  accuracy.  (Kulkarni  et  al.,  2015) 
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Figure  66.  Left,  random  samples;  middle,  training  and  the  use  of  the  generative  model;  right.  Reconstructions  after  MCMC 
iterations,  (b)  The  average  and  individual  log  likelihood  scores  arising  from  randomly  initialized  96  different  chains  vs.  the 
recognition  model  initialized  96  chains.  The  recognition  model  initialized  chains  converge  fast  in  less  than  20  MCMC  sweeps, 
and  the  variability  across  chains  becomes  much  smaller.  (Yildrim  et  al.,  2015). 


Observed  Inferred 
Image  (reconstruction) 


Inferred  model 
re-rendered  with 
novel  poses 


Inferred  model 
re-rendered  with 
novel  lighting 


Figure  67.  Picture  inference  on  representative  faces,  answering  the  question  "what  does  a  given  face  probably  look  like  when 
rotated  or  lit  differently?".  The  first  column  shows  the  input  images.  The  remaining  columns  show  renderings  of  the  inferred  3D 
models  from  a  ~50-line  probabilistic  program  written  in  Picture.  This  example  shows  that  a  short  probabilistic  program  is 
applicable  to  non-frontal  faces  and  provides  reasonable  parses,  using  only  general-purpose  inference  machinery  built  into  Picture. 
(Kulkarni  et  al.,  2015) 
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There  are  many  promising  directions  for  future  research  in  probabilistic  graphics  programming. 
Introducing  a  dependency  tracking  mechanism  could  let  us  exploit  the  many  conditional 
independencies  in  rendering  for  more  efficient  parallel  inference.  Automatic  particle-filter  based 
inference  schemes  (Wood,  van  de  Meent,  and  Mansinghka,  2014;  Kulkarni,  Saeedi,  and 
Gershman,  2014)  could  extend  the  approach  to  image  sequences.  Better  illumination  (Zivanov  et 
al,  2013),  texture  and  shading  models  could  let  us  work  with  more  natural  scenes.  Procedural 
graphics  techniques  (Averkiou  et  al.,  2014;  Fish  et  al.,  2013)  would  support  far  more  complex 
object  and  scene  models  (Zhang  et  al.,  2014;  Gupta,  Efros,  and  Hebert,  2010;  del  Pero  et  al., 
2013;  Hedau,  Hoiem,  and  Forsyth,  2010).  Flexible  scene  generator  libraries  will  be  essential  in 
scaling  up  to  the  full  range  of  scenes  people  can  interpret. 

We  are  also  interested  in  further  extending  Picture  by  taking  insights  from  learning  based 
“analysis-by-synthesis”  approaches  such  as  transforming  auto-encoders  (Hinton,  Krizhevsky, 
and  Wang,  2011),  capsule  networks  (Tielman  2014)  and  deep  convolutional  inverse  graphics 
network  (Kulkarni  et  al.,  2015).  These  models  learn  an  implicit  graphics  engine  in  an  encoder- 
decoder  style  architecture.  With  probabilistic  programming,  the  space  of  decoders  need  not  be 
restricted  to  neural  networks  and  could  consist  of  arbitrary  probabilistic  graphics  programs  with 
internal  parameters. 

The  recent  renewal  of  interest  in  inverse  graphics  approaches  to  vision  has  motivated  a  number 
of  new  modeling  and  inference  tools.  Each  addresses  a  different  facet  of  the  general  problem. 
Earlier  formulations  of  probabilistic  graphics  programming  provided  compositional  languages 
for  scene  modeling  and  a  flexible  template  for  automatic  inference.  Differentiable  renderers 
make  it  easier  to  fine-tune  the  numerical  parameters  of  high-dimensional  scene  models.  Data- 
driven  proposal  schemes  suggest  a  way  to  rapidly  identify  plausible  scene  elements,  avoiding  the 
slow  bum-in  and  mixing  times  of  top-down  MCMC-based  inference  in  generative  models.  Deep 
neural  networks,  deformable  parts  models  and  other  discriminative  learning  methods  can  be  used 
to  automatically  constmct  good  representation  layers  or  similarity  metrics  for  comparing 
hypothesized  scenes  to  observed  images.  Here  we  have  shown  that  by  integrating  all  of  these 
ideas  into  a  single  probabilistic  language  and  inference  framework,  it  may  be  feasible  to  begin 
scaling  up  inverse  graphics  to  a  range  of  real-world  vision  problems. 
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5.0  CONCLUSION 

Our  research  in  this  program  led  to  the  creation  of  multiple  open-source  probabilistic 
programming  languages  and  early  demonstrations  of  their  usefulness  on  a  broad  class  of 
problems.  Specifically,  we  have  shown  that  probabilistic  programming  can  (i)  reduce  the  lines  of 
code  required  to  build  state-of-the-art  machine  learning  systems  by  ~50x;  (ii)  make  machine 
learning  and  data  science  capabilities  accessible  to  a  broader  class  of  programmers,  by  providing 
automatic  model  discovery  mechanisms  and  simple,  SQL-like  query  languages;  (iii)  make  it 
possible  to  deploy  rich  generative  models  to  solve  applied  problems,  and  thereby  solve  hard  3D 
computer  vision  problems  with  no  training  data,  and  (iv)  reveal  interfaces  and  abstractions  that 
unify  a  broad  set  of  probabilistic  programming  languages  and  enable  multiple  inference 
strategies  or  "solvers"  to  interoperate.  Our  research  has  also  helped  to  create  the  technical 
foundations  for  new  collaborations  between  the  machine  learning  and  programming  language 
research  communities. 

Probabilistic  programming  is  in  its  early  days.  There  have  been  successful  early  applications  in 
Bayesian  data  analysis  and  in  probabilistic  AI  research.  However,  almost  all  probabilistic 
programs  are  still  under  50  lines  of  probabilistic  code,  and  written  by  a  small  group  of  people, 
most  of  whom  only  know  how  to  use  a  single  probabilistic  programming  language.  Significant 
performance  and  usability  challenges  must  be  overcome  before  larger  probabilistic  programs  will 
be  practical  to  write  and  before  broader  adoption  will  be  possible. 

However,  there  are  already  new  projects  attempting  to  address  these  limitations.  For  example, 
Reid  Hoffman  (the  founder  of  Linkedin)  recently  agreed  to  give  a  $1M  gift  to  researchers  from 
this  PPAML  team,  to  support  the  continued  development  of  PPAML  technology.  A  specific  goal 
of  this  project  is  to  improve  usability,  so  that  PPAML  software  can  be  applied  to  improve  the 
integrity  and  health  of  the  democratic  process  in  the  United  States. 

There  are  also  early  indications  of  potential  use  cases  of  PPAML  technology  within  the  defense 
and  intelligence  communities.  One  key  focus  area  we  believe  is  ready  for  additional  research  and 
development  is  "data-starved"  artificial  intelligence  domains,  such  as  visual  scene  understanding 
for  ISR  and  autonomous  systems  applications.  Another  is  in  reducing  the  development  cost, 
development  time,  and  maintenance  cost  and  complexity  for  data  science  applications.  Both  of 
these  use  cases  will  require  additional  research  and  development.  A  third  is  in  Al-assisted 
scientific  discovery.  For  example,  the  probabilistic  programming  languages  developed  by  our 
team  are  playing  an  important  role  in  DARPA's  new  Synergistic  Discovery  and  Design  (SD2) 
program,  supporting  Al-assisted  scientific  discovery  and  robust  design  for  problem  domains  such 
as  synthetic  biology. 
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Application  Programming  Interface 
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Abstraet  Syntax  Tree 

BQL 

Bayesian  Query  Language 

CGPM 

Compositional  Generative  Population  Model 

DPM 

Deformable  Parts  Model 

MCMC 

Markov  Chain  Monte  Carlo 

MML 

Meta-modeling  Language 

NIPS 

Neural  Information  Processing  Systems 
Probabilistie  Programming  for  Advaneing  Maehine 

PPAML 

Learning 

SD2 

Synergistic  Discovery  and  Design 

TA 

Technical  Area 
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